Apache Airflow has became de facto in the orchestration market, companies like it because of many reasons. It has a nice UI for task dependencies visualisation, parallel execution, task level retry mechanism, isolated logging, extendability; because of the open source community it comes already with multiple operators and on the top of that companies can define their own operator as well. Are you ready to use it? Alright, only one thing remains: deployment. If our stack is already in Google Cloud then we can choose Cloud Composer as an option which is for sure an easy start. On AWS there is no Airflow as a Service so we have to deploy it ourselves which requires a bit more expertise. The two available cluster types on AWS are AWS ECS or Kubernetes. This post is going to show you a secure deployment concept on AWS ECS provided by Infinite Lambda.
IaC (Infrastructure as Code) is one of the main principles we have to follow. In case of AWS it usually means Cloudformation. The whole stack is split into 4 main layers. With this modularisation concept the development and the maintenance of the stack is easier, these layers are logically separated. E.g. if we are about to change something on the top layer, we can do it without deleting or making any modification on the layers below.
As you can see at the top layer the Airflow Service can have multiple instances, thus we can deploy for example multiple environments. Different (DEV/TEST/UAT) environments are recommended to be deployed so a fully managed CI/CD pipeline can be defined on the top of this architecture as well for the DAG jobs.
1.) Network Layer
This layer is the fundamental part of the deployment. Not only because this is the bottommost layer, but also because here we lay down the security concept of the whole installation. The main component is the VPC where Airflow will live isolated, this wraps around the 2 private and 2 public subnets which spread across 2 availability zones. The host instances with the docker images will be deployed only in the private subnets thus we can avoid direct access from 'outside', but in order to not fully isolate and to have internet access of the services, we are going to use NAT Gateways in the public subnets. Moreover we are going to use one Bastion host in the public subnet to access the services provided by the host from the company network. The bastion host is not represented in this layer, we are going to provision it using another layer.
2.) Storage Layer (Redis + RDS)
Once we have the network layer, we can deploy the storage layer. Storage layer can vary from installation to installation, the most recommended although is a Postgres/MySQL for the Airflow backend, and Redis for the Celery backend. This two components can be defined in one Cloudformation template, or to be more granular we can create one for each. To avoid using plain text passwords in the templates, there is a built-in support for AWS Secret Manager. We just have to store the passwords in AWS SM, and referencing the corresponding path in the template like this:
The databases are configured to be deployed with MultiAZ, so a higher level of availability can be provided in this way.
3.) Cluster Layer
Cluster Layer will handle the basic configurations and the necessary components we need in order to run services on top of this layer. First of all this layer contains the Bastion host, what we can use later on as a jump host to access port 8080 (default Airflow UI) and 5555 (celery default UI ). The cluster object itself an easy one to create, not like the launch configuration of the EC2 instances or the autoscaling service of the instances. Autoscaling is one of the trickiest parts of this whole stack and in my opinion the weakest point as well. It is worth to mention that there is two kinds of scaling rules. There is one for the instances itself, and there is a service level scaling. Both of them are controlled by AWS CloudWatch Alarms and regardless of the direction of the scaling. In this stack we define only the instance level scaling policy, the service level scaling policy will be defined one level above, at Airflow Service level. The agents of the different measurement rules are monitoring constantly the CPU and memory values, and as a reaction on the system's performance change, the number of the EC2 hosts are increasing/decreasing accordingly.
There is one more really important player in this game, and this is the AWS EFS, which is the serverless filesystem service of the AWS. We are going to use this service to have a common base in this system, where the DAG definition files can be copied, so all the container, regardless of what kind of service will run on them, can see volume as it would its own. Any changes we make in this folder (new DAG or DAG modification) will be immediately accessible by all the workers.
4.) Service Layer
This layer as the top layer is going to deploy Airflow itself. As I mentioned in the intro, multiple instances can be deployed, so many different environment can sit on top of this infrastructure we have built so far. This module contains an Application Load Balancer, which is going to serve us a common point of access to this cluster. We have to configure here a the TargetGroup and Listener components as well which is going to drive us the most important part is the task definition itself. We have to configure four different definitions: Flower service (Celery), Worker, Scheduler, Webserver. In this TaskDefinition resource we specify the environment variables, docker cpu, memory resources, logging configurations. For each of every task we use basically the same docker image, but for each Airflow service we define different initial script. We should not forget about the service level scaling policies, which instead of watching the instance level measurement, it checks the service level measurements, so e.g. a high load on a Worker container can in this way be a trigger for a launch of the next Worker container.
Deployment on AWS ECS is not easy, there is a lot of tiny components you really have to care of, especially the scaling policies are the most time consuming part of this whole installation and sometimes it does not behave well as we would expect. The freedom of configuration, which is normally a really useful thing, becomes actually the opposite in this case, you have to have upfront a really clear idea in mind about your future system performance and about your future load characteristics.
Sales spiel: If you are interested in how would Airflow fit in your data stack, contact us for a POC. With the results you can then compare this with your existing system, and you will have an overview about could it help in your data engineering development and operation process. Infinite Lambda helps you optimise your existing pipelines or migrate them to Airflow. We offer initial setup, data engineering development and support at very affordable rates. Reach out via the contact form below.