What is Airflow?

Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.

The philosophy is "Workflows as Code", which serves several purposes

  • dynamic: dynamic pipeline generation
  • extensible: can connect with numerous technologies (other packages, dbs)
  • flexible: parameterization leverages Jinja templating engine.

Here's a simple example

from datetime import datetime from airflow import DAG from airflow.decorators import task from airflow.operators.bash import BashOperator # A DAG represents a workflow, a collection of tasks with DAG(dag_id="demo", start_date=datetime(2022, 1, 1), schedule="0 0 * * *") as dag: # Tasks are represented as operators marco = BashOperator(task_id="marco", bash_command="echo marco") @task() def polo(): print("polo") # Set dependencies between tasks marco >> polo()

There are three main concepts to understand.

  • DAGs: describes the work (tasks) and the order to carry out the workflow
  • Operators: a class that acts as a template for carrying out some work
  • Tasks: a unit of work (node) in a DAG, implements an operator

Here we have a DAG (Directed Acyclic Graph) named "demo" that will start on January 1st 2022, running daily.

We have two tasks defined

  • A BashOperator running a bash script
  • A Python function defined with the @task decorator

>> is a bishift operator which defines the dependency and order of the tasks.

Here it means marco runs first, then polo()

Why Airflow?

  • Coding > clicking
  • version control, can roll back to previous workflows
  • can be developed by multiple people simultaneously
  • write tests to validate functionalities
  • components are extensible

Why not Airflow?

  • not for infinitely-running event-based workflows (this is streaming - Kafka)
  • if you like clicking