Next Generation Data Science and Data Engineering Workflow Frameworks
A Look Below the Surface
There are several new Python-based, open source, data engineering/science frameworks that have come out in the past few years. I didn’t see any grouped comparisons of these tools’ in a single post, so I decided to go through the tools respective tutorials and write down my impressions. I’ll be looking at the following frameworks (though I know there are more out there):
- MLflow (ML management)
- Metaflow (data workflow + ML management)
- Kedro (data workflow + project structure / prototyping)
- Prefect (data workflow)
- Dagster (data workflow)
Spoiler alert — I think the two data workflow tools at the end are aiming to replace the two big batch processing tools that I know about in this space: Airflow and Luigi. Much has been written about these two tools as they’ve matured over the past 5–10 years, so I won’t go into too much detail on these.
As a caveat, I did all the tutorials in Python 3.7, locally on my MacBook, and I did not test out the cloud integrations that many of these tools have. I haven’t used any of these tools in production, not to mention the fact that many of these tools are still in their early or mid stages of development. These are just my thoughts after looking through the documentation and going through their tutorials; I’m sure there are lower level details that I’m missing.
Also, I’d be remiss if I didn’t at least mention a project that my own company works on in this space: https://opendatahub.io/ . It’s a platform and wide-ranging suite of tools, aiming to be a complete end-to-end AI platform. This tool is not as easy to work with locally, so I’ll save it for a separate post.
First up, MLflow (ML management)
https://docs.metaflow.org/getting-started/tutorials
Origin and First Impressions
MLflow is a data science specific Framework developed by Databricks, open sourced in mid-2018. It is not so much a generic DAG-centric execution engine as it is a high level tool that aims to simplify the workflow, of specifically a data scientist, in 3 ways:
- Tracking experiments and runs in a nice UI
- Easily creating a reproducible project and runtime
- Serving and managing ML models
Tasks / Workflow
I was able to get through the tutorial with only a few minor issues (mostly syntactic), and see all three components work. Depending on what aspects of this project you want to use, incorporating this into your existing workflow is basically as simple as adding a few context managers and some log-like statements to your code base. The CLI also has integrations with azureml and aws sagemaker.
Concluding Thoughts
The abstraction layer presented here is about right for what it’s trying to accomplish. It’s not too complex, overly verbose or hard to start using, and the benefits can be seen quickly. Another nice thing about MLflow is that the three main pieces of functionality are linked together, but you can fairly easily pick and choose which ones you want to use if you don’t want to incorporate all three. I particularly enjoyed the tracking server UI, and am actively considering using just this piece at my place of work. I could see this project being useful in conjunction with one of the other projects here, specifically for data scientists.
Metaflow (data workflow + ML management)
https://docs.metaflow.org/getting-started/tutorials
Origin and First Impressions
Metaflow — “A framework for real-life data science” was open sourced by Netflix in late 2019. It’s a tool they developed internally to streamline and simplify the workflow and improve the productivity of their data scientists. Installing it seemed lightweight. The CLI also has options for pulling / running the tutorial.
This is what the tutorial set itself up to cover:
- Explore fault tolerant flows with @catch, @retry, @timeout decorators
- Try a library like Tensorflow with @conda and @batch
- Debug failed runs with resume
- Access large data fast with metaflow.S3
- Explore results across experiments with the Client API
- See how to organize results and collaborate with namespaces and tags
Tasks / Workflow
In the class step method definitions (their units of work) you need to specify self.next() or self.end() pointers for tying together or ending tasks / workflows. The CLI will complain if you don’t do this correctly when you try to run your pipeline. I didn’t see simple function decorators, only the FlowSpec class to inherit from and method step decorators.
Apparently, you can specify certain versions of libraries for individual flows / steps which I think is a kind of unique feature. Conda is needed for this, and I had trouble getting this functionality to work locally. In terms of adopting this to your existing code base, I think it would be a little more challenging, maybe not though. It says that it will persist the results of each step ; you could maybe put in production run-time data QA checks without too much trouble. They have retries for failing tasks and catching errors, and also features to not need to recompute everything leading up to a failure in a pipeline. In terms of more complicated workflows, I saw ways of fairly easily parallelizing tasks, but I didn’t see as extensive dynamic workflow features as, say, Prefect.
Runs produced lots of metadata about each run in the local file system — I assume this location can be changed to S3 or the like. It also looks like their CLI / runner integrates with pylint, which is cool. I saw testing philosophy in their “Internals of Metaflow” section in their documentation, but I didn’t see explicit testing techniques mentioned in their user documentation. I’m not sure how hard it would be to test with this framework. The latter half of the tutorial was working with the AWS or an AWS sandbox that they provide. As that seemed complicated and out of the scope of how I’d likely use it, I didn’t complete that part.
Concluding Thoughts
Metaflow seems fairly tied to AWS for its cloud integrations, so if you use a different cloud provider or want to do something different, it may hard to adapt for your use case. But, I think it’s fairly extensible as well, so maybe it’s just a matter of getting a community contributing back to it for other integrations. It’s a mature product for the perhaps narrower scope of audience it may appeal to.
Kedro (data workflow + project structure / prototyping)
https://kedro.readthedocs.io/en/stable/03_tutorial/01_workflow.html
Origin and First Impressions
Kedro was open sourced by QuantumBlack — a data and performance consulting company under McKinsey in mid-2019. Kedro is “a Python library that implements software engineering best-practice for data and ML pipelines”, and it’s tag is “The centre of your data pipeline.”
Kedro seems to be opinionated about a lot. For example, it has a project-level folder structure that it wants you to adopt. There are also lots of config files, and those need to be hooked up to all of your data sources, logging, secrets, etc. I think these kinds of conventions are good, and certainly a step up from many data scientists’ typical jungle of Jupyter Notebooks, if the aim is to productionize workflows. In terms of adoptability, it may be easier for new projects, as opposed to incorporating it into existing projects. In addition, when installing it, it looked heavy, pulling in lots of dependencies.
Tasks / Workflow
As best I can tell, you have functions (not classes) that are wrapped in subsequent node (units of work) and pipeline structures. I was slightly confused about how order of tasks is defined. It looks like you can run pipelines sequentially or in parallel as well, and you can do partial pipeline runs, but I actually had trouble running my pipeline.
The CLI for Kedro seemed powerful, but I had a few issues with it, and it seemed a little unconventional. You can boot up IPython and Jupyter sessions that are “Kedro aware”, but I seemed to need to be in the same directory level as the CLI script in order to even be using it. A cool feature that I saw is that it has a way of exporting your Jupyter Notebook work to Python files. I’m not sure if, or how, this is different from Papermill.
In terms of testing, the nodes are simply functions, so that seemed relatively straight forward. But outside of regular testing around functions, I didn’t see anything more specific in the documentation around testing pipelines, other than running them. At the end, I was able to build my package by running kedro package
, which created wheels and eggs for the project.
Concluding Thoughts
I think Kedro is a good tool if you’re looking for an opinionated data project framework. And, like I mentioned, it’s certainly a step up from many data scientists’ default workflow. It felt a little on the verbose side though, and the learning curve for using it effectively felt steep.
Prefect (data workflow)
https://docs.prefect.io/core/tutorial/01-etl-before-prefect.html
Origin and First Impressions
Prefect was started by a core contributor to Airflow — Jeremiah Lowin, sometime in the past few years. The sense that I get is that it’s supposed to be an Airflow 2.0, addressing some of the shortcomings of Airflow. Prefect’s written extensively about their tool in comparison to Airflow. Their business model looks to be a Freemium one, and they’ve open sourced most, if not all, of the core parts of their product ecosystem. Getting through the tutorial was relatively straightforward, and I got the sense that I was just scratching the surface of Prefect. All the while, feeling like I was just writing regular old python with a few decorators and context managers.
Tasks / Workflow
Prefect has decorators for functions (‘@task’) and Task classes you can inherit from for your units of compute. It seemed dead simple to me to incorporate and get started adopting their tools. There is check-pointing around the state after each task runs. This encourages small tasks, and data can flow between tasks. Perhaps you could put runtime data QA checks with a tool like great expectations there. There’s also caching support around tasks, which means it’s easy to avoid recomputing tasks unnecessarily. Tasks can also optionally receive inputs and parameters, and produce outputs. You’re able to specify LOTS of logic in DAG generation / flow via signals, triggers, etc. And, they have lots of specific kinds of tasks ready to go, out of the box. Tasks get put into pipelines (called “Flows”), which seemed simple enough to use.
Unit testing seems reasonable to implement — at least from the examples. The docstrings didn’t seem to be super helpful, but the tool does seem to be type annotation aware and friendly. The metadata around tasks and the workflow seems fairly rich as well. Their CLI seemed reasonable, and it looked like it connected with their cloud services.
Installing the UI was fairly straightforward. It pulled lots of dependencies into local docker containers, and seemed to start up without error. I had trouble running the tutorial example flow in the UI though. It looked to me like this was a premium only feature. I’m seeing that they have open sourced their UI recently, too. Update: I needed to run a command first that wasn’t clear to me in the tutorial (prefect backend server
). Once I did that I was able to see runs I had completed locally. The UI felt to me like a flashier, upgraded version of the Airflow UI.
Two nice features that I was able to get working without difficulty were using a different Executor back-end and the Scheduler. Prefect uses Dask as an Executor (easily parallelizable) of tasks by default, but this system looks to pluggable with a relatively straightforward API (as far as I know, their system, as a whole, is fairly pluggable). And, their Scheduler provides many different configurations out of the box.
Concluding Thoughts
Prefect seems very promising, packed with features and well thought out, with a simple, beautiful and powerful abstraction layer; it really did feel to me like a rewritten Airflow version 2.0. They want you to be able specify any kind of workflow you want without difficulty, no matter how complex they get. The only thing I’m a little confused about is their business model. Update (04/25/2020): I heard back from Jeremiah that they recently evolved it to a more traditional open-core model. In his words:
“We offer an open-source workflow system (the UI being the last piece, released three weeks ago), and we also offer a managed version of our orchestration platform (server + UI), which includes some more advanced features interesting to large companies (like global concurrency limits, roles and permissions, etc.). The system is designed so that no matter whether you run it locally or through Prefect Cloud, your code and data are always executed on your own infrastructure — that’s why we work so hard to make the back-end switch a single CLI command. By doing this, we made it easy for companies in regulated industries, especially financial services and healthcare, to adopt Prefect by running it locally to PoC and then immediately transition to our managed product.”
Finally, Dagster (data workflow)
https://docs.dagster.io/docs/tutorial
Origin and First Impressions
I discovered Dagster, listening to the data engineering podcast recently (Kedro and Prefect for that matter, too — great podcast by the way — shout out to Tobias Macey and company, for all you’ve done for the Python community over the years!). Dagster was created by Nick Schrock, an ex-Facebook engineer, best known for co-creating GraphQL , and open sourced in mid-2019. Dagster is supposed to be the tool to data applications that React was for front-end web applications. A high bar to set for a project; let’s see how it fares.
The tutorial took a simple example of reading a CSV and doing a simple data manipulation on it, and it incrementally introduced the abstractions it provides. I don’t think would be hard at all to start to incorporate this into an existing library. The docstrings were well-formed and informative, which is helpful while learning a new library. Getting through to the end of the tutorial was the most conceptually demanding of the tools here (the config aspect especially), but I think it’s well worth it. I liked that I was able to do all of the tutorial on my local machine, and get a sense for its power. If you really want to get into the nitty-gritty, it may be worth cloning the Dagster repo — and looking at their examples folder; it’s slightly easier to follow along that way. The features that stuck out to me most are:
- Its own optional and gradual type system (still allowing for mypy compliance)
- Parameterization system, through config
- Testing as a first class concept — unit, integration and real stubs for automatic data quality checks
- Pluggable runtime compute system with lots of integrations already in place
- Built-in, cron-based scheduler (had trouble getting to run locally)
- Rich and powerful UI (called “Dagit”)
Tasks / Workflow
There were a few class-based aspects of their system in the tutorial, but for the most part, Dagster seems to implement tasks (called ‘solids’) and pipelines (called ‘pipelines’) through richly configured decorators, with an emphasis on functional programming. Solids are meant to be configurable / re-usable and composable. As such, Dagster distinguishes between defining and invoking a solid. Downstream solids can depend on any number of upstream solids, which means you can write pipelines where some solids conditionally execute based on the presence of an upstream output. You can fairly easily persist arbitrary artifacts in solids and output between solids (called “materialization”), and in turn, use these instead of re-computing parts of pipelines if need be.
Pipelines can get organized into what are called ‘repositories’, so that you can run and monitor any number of different pipelines in the UI (Dagit). You can pass in widely used resources (access to file systems, dbs, cloud-service) to pipelines. This allows these resources to be swapped out at runtime very easily. And if these resources have the same API, you can define different modes for your pipeline to run in, allowing business logic to be tested independently from the environment it may be running in. These are accessed in solids through a context object that lives throughout the life-cycle of a run.
The custom type system and automatic data QA (called ‘expectations’) checks were the most interesting aspect of Dagster to me, as this is a common problem in data pipelines that I haven’t seen addressed very much. Along with this, testing (unit, integration, QA) was addressed in several different ways and sections of the tutorial. It’s not straightforward to ask the question: What does the state of the world need to be like before and after this arbitrary computation? This aspect of Dagster is crucial in helping to answer that question.
At this point, it doesn’t seem that features around dynamic creation and execution of DAGs in Dagster are natively supported. But, I think some of these aspects can be addressed in the execution engine that is chosen. I saw documentation for in-process, multi-process, Celery and Dask computation engines. There’s lots of potential config to be written (could be in Python dicts, YAML files and/or the UI config editor). When these configs are likely to not change much, they can be preset (dynamically or statically). Dagster also comes with many other integrations out of the box including Airflow, Spark, Datadog, Jupyter / Pandas, Snowflake, AWS, and GCP. I did not explore or test these integrations, as they were not part of the tutorial.
Concluding Thoughts
One word comes to mind as I was completing this tutorial — abstraction. And in order to truly be a tool that can unify and meaningfully glue together the mess today that is data pipelining, it would have to be! It’s the kind of system that you can gradually adopt, and the more you do, the more benefits you reap. A system that fully takes advantage of Dagster, I think could mostly be considered config (which is a good thing). It makes a big distinction between business logic and compute / environment, and provides the most comprehensive tooling for separating these concerns that I have seen.
Have you used any of these frameworks? I’d love to hear about your experience with them. Feel free to let me know comments / questions and if there are other frameworks that you’d like to see my initial thoughts on!