Apache Airflow is an open-source workflow scheduling platform, widely used in the data engineering field. Find out everything you need to know about this Data Engineer tool: how it works, use cases, main components…
The story of Apache Airflow begins in 2015, in the offices of AirBnB. At that time, the vacation rental platform founded in 2008 was experiencing meteoric growth and was overwhelmed by an increasingly massive volume of data.
The Californian company was hiring Data Scientists, Data Analysts, and Data Engineers in droves, who had to automate numerous processes by writing scheduled batch jobs. To help them, data engineer Maxime Beauchemin created an open-source tool called Airflow.
This scheduling tool aims to allow teams to create, monitor and iterate on batch data pipelines. In a few years, Airflow has become a standard in the data engineering field.
In April 2016, the project joined the official Apache Foundation incubator. It continues its development and receives the status of a “top-level” project in January 2019. Almost two years later, in December 2020, Airflow has more than 1400 contributors, 11,230 contributions, and 19,800 stars on GitHub.
The Airflow 2.0 version is available since December 17, 2020, and brings new features and many improvements. This tool is used by thousands of Data Engineers around the world.
What is Apache Airflow?
The Apache Airflow platform allows you to create, schedule and monitor workflows through computer programming. It is a completely open-source solution, very useful for architecting and orchestrating complex data pipelines and task launches.
It has several advantages. First of all, it is a dynamic platform, since anything that can be done with Python code can be done on Airflow.
It is also extensible, thanks to many plugins allowing interaction with most common external systems. It is also possible to create new plugins to meet specific needs.
In addition, Airflow provides elasticity. Data Engineers’ teams can use it to run thousands of different tasks every day.
Workflows are architected and expressed as Directed Acyclic Graphs (DAGs), where each node represents a specific task. Airflow is designed as a “code-first” platform, allowing it to iterate very quickly on workflows. This philosophy offers a high degree of scalability compared to other pipeline tools.
What is Airflow used for?
Airflow can be used for any batch data pipeline, so its use cases are as numerous as they are diverse. Due to its scalability, this platform particularly excels at orchestrating tasks with complex dependencies on multiple external systems.
By writing pipelines in code and using the various plugins available, it is possible to integrate Airflow with any dependent systems from a unified platform for orchestration and monitoring.
As an example, Airflow can be used to aggregate daily sales team updates from Salesforce to send a daily report to company executives.
In addition, the platform can be used to organize and launch Machine Learning tasks running on external Spark clusters. It can also load website or application data to a data warehouse once an hour.
What are the different components of Airflow?
The Airflow architecture is based on several components. Here are the main ones.
In Airflow, pipelines are represented as DAGs (Directed Acyclic Graphs) defined in Python.
A graph is a structure composed of objects (nodes) in which certain pairs of objects are related. They are “Directed”, which means that the edges of the graph are oriented and that they, therefore, represent unidirectional links.
“Acyclic”, because the graphs do not have a circuit. This means that node B downstream of node A cannot also be upstream of node A. This ensures that pipelines do not have infinite loops.
Each node in a DAG represents a task. It is a representation of a sequence of tasks to be performed, which constitutes a pipeline. The represented jobs are defined by the operators
The operators are the building blocks of the Airflow platform. They are used to determine the work done. It can be an individual task (node of a DAG), defining how the task will be executed.
The DAG ensures that the operators are scheduled and executed in a specific order, while the operators define the jobs to be executed at each step of the process.
There are three main categories of operators. First, action operators perform a function. Examples are the PythonOperator or the BashOperator.
Transfer operators allow the transfer of data from a source to a destination, like the S3ToRedshiftOperator.
Finally, the Sensors allow waiting for a condition to be verified. For example, the FileSensor operator can be used to wait for a file to be present in a given folder, before continuing the execution of the pipeline.
Each operator is defined individually. However, operators can communicate information to each other using XComs.
On Airflow, Hooks allow interfacing with third-party systems. They allow the connection between APIs and external databases like Hive, S3, GCS, MySQL, and Postgres…
Confidential information, such as login credentials, are kept outside the Hooks. They are stored in an encrypted metadata database associated with the current Airflow instance.
Airflow plugins can be described as a combination of Hooks and Operators. They are used to accomplish specific tasks involving an external application.
An example would be transferring data from Salesforce to Redshift. There is an extensive open-source collection of plugins created by the user community, and each user can create plugins to meet their specific needs.
Connections allow Airflow to store information, allowing it to connect to external systems such as API credentials or tokens.
They are managed directly from the platform’s user interface. The data is encrypted and stored as metadata in a Postgres or MySQL database.