Airflow in ECS with Redis - Part 1: Overview

Airflow in ECS with Redis - Part 1: Overview

How to set up a containerised Airflow installation in AWS ECS using Redis as its queue orchestrator.

A bit of background

A few years ago I joined a Data team where we processed a lot of analytics information coming from online search engines. This ETL process consisted of three main stages: fetch raw data from external APIs, transform it into something meaningful for our applications and load that data into a database.

All this was carried on with in-house developed DAG scripts that were orchestrated via Airflow. The problem was that this intake was so lengthy and unstable that it was getting really painful for our daily performance.

Having previous ECS experience, I thought we could try improving the process using two main changes:

  • Decouple the Airflow controller from the runners, using ECS.

  • Add a queue orchestrator for improved parallelism that could shorten the processing times, using Airflow's native Celery executor (integration) with Redis.

What is Airflow?

Airflow is a DAG orchestrator, a platform that provides management for workflows.

You can schedule, monitor and even create workflows to execute a set of tasks in a predefined order and dependent on each other. Airflow will control these tasks execution while giving insights and a UI where this can all be monitored, among many other features.

Have a look at Airflow documentation for more information.

What is ECS?

AWS ECS is a container orchestrator to deploy, manage and scale your containerised applications. It integrates with a lot of AWS services, which makes it very flexible and easy to use if your infrastructure is already in AWS.

Check out a few ECS uses cases.

What is Redis?

Is an in-memory data store, used as a database, message broker, cache, etc.

AWS ElastiCache offers Redis as one of its available engines.

A bit more on Redis.

How everything glues together?

Remote execution using Redis

Airflow has many executor integrations. An executor is the mechanism by which tasks get run, either locally or remotely. So you can configure Airflow to have it locally executed - for small, standalone machine installations - or remotely if you are planning to run a considerable amount of tasks per workflow and have access to a pool of resources, like in this case, ECS.

One of the remote executors is Celery, which allows you to span out to many workers and supports different backends, such as RabbitMQ and Redis.

DAGs storage

In this decentralised pattern, every worker has to have access to the scripts (DAGs) to be able to run their assigned task. To do so, you can mount an S3 bucket - containing all the scripts - as a local volume to all ECS tasks, using drivers such as s3fs or s3fs-fuse, rexray/s3fs, etc.

This S3 access is possible by adding the required IAM permissions to the ECS tasks, allowing both scheduler and worker to interact with the mounted bucket.

Decoupling Airflow scheduler and the workers

The whole idea of this pattern is to separate the brain from the muscle. Using ECS services, you can create a service for the scheduler (controller) and another for the worker tasks, controlled by the executor.

The former would run with a minimum of 1 task to keep things running, along with access to its webserver (UI). The latter will scale up/down depending on the scheduler sent workload.

Some additional comments

  • The IAM credentials are fetched from AWS SecretsManager or Parameter Store and mounted as environment variables on each task.

  • The S3 mounting driver installation is added to each ECS EC2 instance userData so it gets installed, initialised and mounted.

Improvements

  • Use a git repository as DAGs datastore and have each runner download that repo in startup time.

  • Docker added an ECS integration for its docker compose. This is done by previously configuring Docker context to ECS.

    • With this integration, there is no need to create ECS tasks definitions, you can directly work with Airflow's docker-compose.yaml file and create all infrastructure resources in AWS (via native CloudFormation integration).

      • One of the key differences is that it will use AWS EFS as filesystem instead of S3, so there is no need to mount buckets as tasks volumes - no need for external drivers either.
    • This integration was added in 2020 (way after my experience), so I will also include this new deployment approach in the next posts of this series.

  • Explore Airflow's Kubernetes executor.

References

What's next?

I will be sharing a working example of the original implementation and an improved version using docker compose ECS integration.


Thank you for stopping by! Do you know other ways to do this? Please let me know in the comments, I always like to learn how to do things differently.

Did you find this article valuable?

Support Mariano González by becoming a sponsor. Any amount is appreciated!