Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Refactor and add support for schedule conditions in DAG configuration: #320

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ErickSeo
Copy link

Description

This feature introduces a enhancement to DAG scheduling in Airflow, enabling support for dynamic schedules based on dataset conditions. By leveraging dataset filters and logical conditions, users can now create more flexible and precise scheduling rules tailored to their workflows.

Key Features:

  • Condition-Based Scheduling: Allows defining schedules using logical conditions between datasets (e.g., ('dataset_1' & 'dataset_2') | 'dataset_3'), enabling workflows to trigger dynamically based on dataset availability.

  • Dynamic Dataset Processing: Introduced the process_file_with_datasets function to evaluate and process dataset URIs from external files, supporting both simple and condition-based schedules.

  • Improved Dataset Evaluation: Developed the evaluate_condition_with_datasets function to transform dataset URIs into valid variable names and evaluate logical conditions securely.

Workflow Example:
Given the following condition:

example_custom_config_condition_dataset_consumer_dag:
  description: "Example DAG consumer custom config condition datasets"
  schedule:
    file: $CONFIG_ROOT_DIR/datasets/example_config_datasets.yml
    datasets: ['dataset_custom_1', 'dataset_custom_2', 'dataset_custom_3']
    conditions: "((dataset_custom_1 & dataset_custom_2) | dataset_custom_3)"
  tasks:
    task_1:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: "echo 'consumer datasets'"
example_without_custom_config_condition_dataset_consumer_dag:
  description: "Example DAG consumer custom config condition datasets"
  schedule:
    datasets: ['s3://bucket-cjmm/raw/dataset_custom_1', 's3://bucket-cjmm/raw/dataset_custom_2', 's3://bucket-cjmm/raw/dataset_custom_3']
    conditions: "((s3://bucket-cjmm/raw/dataset_custom_1 & s3://bucket-cjmm/raw/dataset_custom_2) | s3://bucket-cjmm/raw/dataset_custom_3)"
  tasks:
    task_1:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: "echo 'consumer datasets'"

The system evaluates the datasets, ensuring valid references, and schedules the DAG dynamically when the condition resolves to True.

Example Use Case:
Consider a data pipeline that processes files only when multiple interdependent datasets are updated. With this feature, users can create dynamic DAG schedules that automatically adjust based on dataset availability and conditions, optimizing resource allocation and execution timing.

Images:
Captura de tela 2024-12-16 181059
Captura de tela 2024-12-16 181103
Captura de tela 2024-12-16 181131

- Added support for schedules defined by conditions, enabling dynamic scheduling based on dataset filters and conditions.
- Introduced `configure_schedule` function to streamline DAG schedule setup based on Airflow version and parameters.
- Created `process_file_with_datasets` function to handle dataset processing and conditional evaluation from files.
- Implemented `evaluate_condition_with_datasets` to evaluate schedule conditions while ensuring valid variable names for dataset URIs.
- Replaced repetitive code with reusable functions for better modularity and maintainability.
- Enhanced code readability by adding detailed docstrings for all functions, following a standard format.
- Improved safety by avoiding reliance on `globals()` in `evaluate_condition_with_datasets`.
ErickSeo added 7 commits December 17, 2024 09:32
- remove self from unit test
- Implemented logic to handle schedules with both file and datasets attributes.
- Added support for evaluating conditions with datasets for Airflow version 2.9 and above.
- Cleaned up schedule dictionary by removing processed keys.
- Added logic to handle schedules with both file and datasets attributes.
- Implemented support for evaluating conditions with datasets for Airflow version 2.9 and above.
- Cleaned up schedule dictionary by removing processed keys after use.
@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 95.31250% with 3 lines in your changes missing coverage. Please review.

Project coverage is 93.33%. Comparing base (017bc30) to head (89854a8).

Files with missing lines Patch % Lines
dagfactory/utils.py 76.92% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #320      +/-   ##
==========================================
+ Coverage   93.29%   93.33%   +0.03%     
==========================================
  Files          10       10              
  Lines         776      825      +49     
==========================================
+ Hits          724      770      +46     
- Misses         52       55       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants