Thursday, June 24, 2021

Apache Airflow (Google Compose) and Apache Beam (Google DataFlow)

 Airflow is a task management system. The nodes of the DAG [directed acyclic graph (DAG)] are tasks and Airflow makes sure to run them in the proper order, making sure one task only starts once its dependency tasks have finished. Dependent tasks don't run at the same time but only one after another. Independent tasks can run concurrently.

Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.

It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.

Also, it is easy to setup and everything is in familiar Python code.

Airflow manages tasks, which depend on one another. While this dependency can consist of one task passing data to the next one, that is not a requirement. In fact Airflow doesn't even care what the tasks do, it just needs to start them and see if they finished or failed. If tasks need to pass data to one another you need to co-ordinate that yourself, telling each task where to read and write its data, e.g. a local file path or a web service somewhere. Tasks can consist of Python code but they can also be any external program or a web service call.

when using google cloud, we may need a folder to save temp data from one step to another, or save custom python script to call it through DAG

Here is the available paths 




here is a sample DAG file

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/c5635d1146fc2c0ff284c41d4b2d1132b25ae270/composer/workflows/simple.py


Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.

The intent is so you just learn Beam and can run on multiple backends (Beam runners).

If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.

Beam is a dataflow engine. The nodes of the DAG form a (possibly branching) pipeline. All the nodes in the DAG are active at the same time, and they pass data elements from one to the next, each doing some processing on it.

In Beam, your step definitions are tightly integrated with the engine. You define the steps in a supported programming language and they run inside a Beam process. Handling the computation in an external process would be difficult. Your steps only need to worry about the computation they're performing, not about storing or transferring the data. Transferring the data between different steps is handled entirely by the framework.

Beam is actually an abstraction layer. Beam pipelines can run on Apache Spark, Apache Flink, Google Cloud Dataflow and others. All of these support a more or less similar programming model. 

No comments: