An Airflow setup that aims to work well with Hadoop and Spark. This is a base image that should be derived further by individual projects as needed. Unless the defaults are sufficient, derivative images would have to set the necessary configurations (see below).
This image is based off the python-spark
image and contains standard Python, Hadoop and Spark installations.
- Python 3.5
- Airflow 1.10 (with PostgreSQL 9.6)
- Spark 1.6 and 2.4.1
- Hadoop 2.7.3
- Sqoop 1.4.6 (with JDBC connectors for PostgreSQL, MySQL and SQL Server)
The following instructions refer to configuring the Compose files of derivative images. It is assumed that the Dockerfile the Compose files are using is based on this image. The repository contains example Compose files for reference.
Password authentication is enabled as a security mechanism for administering Airflow via its admin UI.
Set AIRFLOW_USER
, AIRFLOW_EMAIL
and AIRFLOW_PASSWORD
under webserver
service in the docker compose file before starting the container.
Every time the airflow web server starts, it will create the user if it does not exist.
Place the airflow DAGs in ./dags which will be copied into the image
The default docker user is 'afpuser' and group is 'hadoop'. Subsequent images that bases on this image can change the user by specifying docker env vars for 'USER' and 'GROUP' respectively. For example, the Compose file that is based on this image would look something like this:
scheduler:
environment:
USER: someuser
GROUP: somegroup
command: ["some-scheduler"]
...
Based on the specified docker user and group, see Dockerfile. Therefore, your Hadoop admin should also add the same user and group to your hadoop cluster. Also grant the necessary HDFS directory/file permissions.
To write to HDFS and connect to the YARN ResourceManager, the (client side) configuration files for Hadoop, Yarn and Hive must be added. Unlike previous versions of this image, the Hadoop configuration files has to be included when building this or any derivative images as they will be copied into the image.
In your derivative images, you should copy these configuration files and set the environment variables HADOOP_CONF_DIR
, HIVE_CONF_DIR
and YARN_CONF_DIR
appropriately.
The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
See also http://spark.apache.org/docs/latest/running-on-yarn.html
The Airflow scheduler and webserver containers can be deployed on an Amazon EMR cluster in the master node. Docker engine and Docker compose has to be installed on the EMR master instance either by preparing a custom AMI or using bootstrap actions. The building of the custom AMI, secrets management and encryption are beyond the scope of this README. Regardless, consider provisioning an EMR cluster quickly using Terraform and the terraform-aws-emr-cluster module
To start the Airflow containers in the master instance,
sudo /usr/local/bin/docker-compose -f docker-compose.emr.yml up -d
Logging to S3 bucket is enabled by the following lines in docker-compose.emr.yml
:
AIRFLOW__CORE__TASK_LOG_READER: s3.task
AIRFLOW__CORE__LOGGING_CONFIG_CLASS: s3_log_config.LOGGING_CONFIG
S3_LOG_FOLDER: s3://fixme/
To disable S3 logging, simply comment out the above three lines of configuration.
Airflow webserver runs at port 8888 as port 8080 is used by Tomcat (Apache Tez).
Tested on Amazon EMR 5.
You might have to docker login
first before you can build any images.
To change the environment variables, edit docker-compose.yml
instead of
Dockerfile
without the need to rebuild the docker image.
To start, use only the docker-compose.yml
file i.e. docker-compose -p afp -f docker-compose.yml up --build -d
To start with Macvlan networking mode, use only the docker-compose.macvlan.yml
file i.e. docker-compose -p afp -f docker-compose.macvlan.yml up --build -d
Make sure that the containers also have name resolution configured so it can communicate with resources on the network. (e.g. extra_hosts
, dns
and dns_search
configurations in Compose)
docker network create -d macvlan --subnet=192.168.150.0/24 --ip-range=192.168.150.48/28 -o parent=p2p1 afpnet
To start distributed Airflow (using Celery), docker-compose -f docker-compose.yml -f docker-compose.celeryexecutor.yml up --scale worker=3 -d
with three Airflow workers.
To bring the containers up for development, use also the docker-compose.override.yml
. This will additionally create a volume at ./dags
and mounted in the container at /airflow/dags
, allowing you to do edit the DAG files directly on your development machine and having them updated with the container immediately.
Since credentials are managed as environment variables, it is recommended that your env file or docker-compose.production.yml
be stored securely. Do not commit them to any source code repositories.
In the given docker-compose.yml
, the environment variables used to store credentials are placeholders only.
To run tests, use docker-compose -f tests/docker-compose.test.yml up --build
To follow docker logs, use docker-compose -p afp -f docker-compose.yml logs --tail=10 -f
- Ensure that container is deployed on your server
- SSH into server
- Access the container's bash shell:
docker exec -ti afp_airflow_1 bash
- Change user to afpuser:
gosu afpuser bash
- Run the airflow backfill:
airflow backfill -s <YYYY-MM-DD> -e <YYYY-MM-DD> <dag_name>
- Run the airflow DAG:
airflow trigger_dag <dag_name>