[Feature] Refactor and add support for schedule conditions in DAG configuration: #320

ErickSeo · 2024-12-16T21:14:54Z

Description

This feature introduces a enhancement to DAG scheduling in Airflow, enabling support for dynamic schedules based on dataset conditions. By leveraging dataset filters and logical conditions, users can now create more flexible and precise scheduling rules tailored to their workflows.

Key Features:

Condition-Based Scheduling: Allows defining schedules using logical conditions between datasets (e.g., ('dataset_1' & 'dataset_2') | 'dataset_3'), enabling workflows to trigger dynamically based on dataset availability.
Dynamic Dataset Processing: Introduced the process_file_with_datasets function to evaluate and process dataset URIs from external files, supporting both simple and condition-based schedules.
Improved Dataset Evaluation: Developed the evaluate_condition_with_datasets function to transform dataset URIs into valid variable names and evaluate logical conditions securely.

Workflow Example:
Given the following condition:

example_custom_config_condition_dataset_consumer_dag:
  description: "Example DAG consumer custom config condition datasets"
  schedule:
    file: $CONFIG_ROOT_DIR/datasets/example_config_datasets.yml
    datasets: ['dataset_custom_1', 'dataset_custom_2', 'dataset_custom_3']
    conditions: "((dataset_custom_1 & dataset_custom_2) | dataset_custom_3)"
  tasks:
    task_1:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: "echo 'consumer datasets'"

example_without_custom_config_condition_dataset_consumer_dag:
  description: "Example DAG consumer custom config condition datasets"
  schedule:
    datasets: ['s3://bucket-cjmm/raw/dataset_custom_1', 's3://bucket-cjmm/raw/dataset_custom_2', 's3://bucket-cjmm/raw/dataset_custom_3']
    conditions: "((s3://bucket-cjmm/raw/dataset_custom_1 & s3://bucket-cjmm/raw/dataset_custom_2) | s3://bucket-cjmm/raw/dataset_custom_3)"
  tasks:
    task_1:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: "echo 'consumer datasets'"

The system evaluates the datasets, ensuring valid references, and schedules the DAG dynamically when the condition resolves to True.

Example Use Case:
Consider a data pipeline that processes files only when multiple interdependent datasets are updated. With this feature, users can create dynamic DAG schedules that automatically adjust based on dataset availability and conditions, optimizing resource allocation and execution timing.

Images:

- Added support for schedules defined by conditions, enabling dynamic scheduling based on dataset filters and conditions. - Introduced `configure_schedule` function to streamline DAG schedule setup based on Airflow version and parameters. - Created `process_file_with_datasets` function to handle dataset processing and conditional evaluation from files. - Implemented `evaluate_condition_with_datasets` to evaluate schedule conditions while ensuring valid variable names for dataset URIs. - Replaced repetitive code with reusable functions for better modularity and maintainability. - Enhanced code readability by adding detailed docstrings for all functions, following a standard format. - Improved safety by avoiding reliance on `globals()` in `evaluate_condition_with_datasets`.

- remove self from unit test

- Implemented logic to handle schedules with both file and datasets attributes. - Added support for evaluating conditions with datasets for Airflow version 2.9 and above. - Cleaned up schedule dictionary by removing processed keys.

- Added logic to handle schedules with both file and datasets attributes. - Implemented support for evaluating conditions with datasets for Airflow version 2.9 and above. - Cleaned up schedule dictionary by removing processed keys after use.

codecov-commenter · 2024-12-19T01:03:28Z

Codecov Report

Attention: Patch coverage is 95.31250% with 3 lines in your changes missing coverage. Please review.

Project coverage is 93.33%. Comparing base (017bc30) to head (89854a8).

Files with missing lines	Patch %	Lines
dagfactory/utils.py	76.92%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
+ Coverage   93.29%   93.33%   +0.03%     
==========================================
  Files          10       10              
  Lines         776      825      +49     
==========================================
+ Hits          724      770      +46     
- Misses         52       55       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ErickSeo requested a review from a team as a code owner December 16, 2024 21:14

ErickSeo mentioned this pull request Dec 16, 2024

[Feature] Support for Condition-Based Scheduling in Airflow DAGs #321

Open

1 task

ErickSeo added 7 commits December 17, 2024 09:32

add utils unit tests

4413aad

fix unit test

5eddb01

fix:

ba3ea4b

- remove self from unit test

lint

74effa6

fix unit test

08d1fa7

fix ruff

89854a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Refactor and add support for schedule conditions in DAG configuration: #320

[Feature] Refactor and add support for schedule conditions in DAG configuration: #320

ErickSeo commented Dec 16, 2024

codecov-commenter commented Dec 19, 2024 •

edited

Loading

[Feature] Refactor and add support for schedule conditions in DAG configuration: #320

Are you sure you want to change the base?

[Feature] Refactor and add support for schedule conditions in DAG configuration: #320

Conversation

ErickSeo commented Dec 16, 2024

Description

codecov-commenter commented Dec 19, 2024 • edited Loading

Codecov Report

codecov-commenter commented Dec 19, 2024 •

edited

Loading