This document describes how you can create Pull Requests and describes coding standards we use when implementing them.
The outline for this document in GitHub is available at top-right corner button (with 3-dots and 3 lines).
Before you submit a Pull Request (PR) from your forked repo, check that it meets these guidelines:
Include tests, either as doctests, unit tests, or both, to your pull request.
The airflow repo uses GitHub Actions to run the tests and codecov to track coverage. You can set up both for free on your fork. It will help you make sure you do not break the build with your PR and that you help increase coverage. Also we advise to install locally pre-commit hooks to apply various checks, code generation and formatting at the time you make a local commit - which gives you near-immediate feedback on things you need to fix before you push your code to the PR, or in many case it will even fix it for you locally so that you can add and commit it straight away.
Follow our project's Coding style and best practices. Usually we attempt to enforce the practices by having appropriate pre-commits. There are checks amongst them that aren't currently enforced programmatically (either because they are too hard or just not yet done).
Maintainers will not merge a PR that regresses linting or does not pass CI tests (unless you have good justification that it a transient error or something that is being fixed in other PR).
Maintainers will not merge PRs that have unresolved conversation. Note! This is experimental - to be assessed at the end of January 2024 if we want to continue it.
We prefer that you
rebase
your PR (and do it quite often) rather than merge. It leads to easier reviews and cleaner changes where you know exactly what changes you've done. You can learn more about rebase vs. merge workflow in Rebase and merge your pull request and Rebase your fork. Make sure to resolve all conflicts during rebase.When merging PRs, Maintainer will use Squash and Merge which means then your PR will be merged as one commit, regardless of the number of commits in your PR. During the review cycle, you can keep a commit history for easier review, but if you need to, you can also squash all commits to reduce the maintenance burden during rebase.
Add an Apache License header to all new files. If you have
pre-commit
installed, pre-commit will do it automatically for you. If you hesitate to install pre-commit for your local repository - for example because it takes a few seconds to commit your changes, this one thing might be a good reason to convince anyone to install pre-commit.If your PR adds functionality, make sure to update the docs as part of the same PR, not only code and tests. Docstring is often sufficient. Make sure to follow the Sphinx compatible standards.
Make sure your code fulfills all the static code checks we have in our code. The easiest way to make sure of that is - again - to install pre-commit hooks
Make sure your PR is small and focused on one change only - avoid adding unrelated changes, mixing adding features and refactoring. Keeping to that rule will make it easier to review your PR and will make it easier for release managers if they decide that your change should be cherry-picked to release it in a bug-fix release of Airflow. If you want to add a new feature and refactor the code, it's better to split the PR to several smaller PRs. It's also quite a good and common idea to keep a big
Draft
PR if you have a bigger change that you want to make and then create smaller PRs from it that are easier to review and merge and cherry-pick. It takes a long time (and a lot of attention and focus of a reviewer to review big PRs so by splitting it to smaller PRs you actually speed up the review process and make it easier for your change to be eventually merged.Run relevant tests locally before opening PR. Often tests are placed in the files that are corresponding to the changed code (for example for
airflow/cli/cli_parser.py
changes you have tests intests/cli/test_cli_parser.py
). However there are a number of cases where the tests that should run are placed elsewhere - you can either run tests for the wholeTEST_TYPE
that is relevant (seebreeze testing core-tests --help
orbreeze testing providers-tests --help
output for available test types for each of the testing commands) or you can run all tests, or eventually you can push your code to PR and see results of the tests in the CI.You can use any supported python version to run the tests, but the best is to check if it works for the oldest supported version (Python 3.9 currently). In rare cases tests might fail with the oldest version when you use features that are available in newer Python versions. For that purpose we have
airflow.compat
package where we keep back-ported useful features from newer versions.Adhere to guidelines for commit messages described in this article. This makes the lives of those who come after you (and your future self) a lot easier.
In December 2023 we enabled - experimentally - the requirement to resolve all the open conversations in a PR in order to make it merge-able. You will see in the status of the PR that it needs to have all the conversations resolved before it can be merged.
This is an experiment and we will evaluate by the end of January 2024. If it turns out to be a good idea, we will keep it enabled in the future.
The goal of this experiment is to make it easier to see when there are some conversations that are not
resolved for everyone involved in the PR - author, reviewers and maintainers who try to figure out if
the PR is ready to merge and - eventually - merge it. The goal is also to use conversations more as a "soft" way
to request changes and limit the use of Request changes
status to only those cases when the maintainer
is sure that the PR should not be merged in the current state. That should lead to faster review/merge
cycle and less problems with stalled PRs that have Request changes
status but all the issues are
already solved (assuming that maintainers will start treating the conversations this way).
Most of our coding style rules are enforced programmatically by ruff and mypy, which are run automatically with static checks and on every Pull Request (PR), but there are some rules that are not yet automated and are more Airflow specific or semantic than style.
Our community agreed that to various reasons we do not use assert
in production code of Apache Airflow.
For details check the relevant mailing list thread.
In other words instead of doing:
assert some_predicate()
you should do:
if not some_predicate():
handle_the_case()
The one exception to this is if you need to make an assert for type checking (which should be almost a last resort) you can do this:
if TYPE_CHECKING:
assert isinstance(x, MyClass)
Explicit is better than implicit. If a function accepts a session
parameter it should not commit the
transaction itself. Session management is up to the caller.
To make this easier, there is the create_session
helper:
from sqlalchemy.orm import Session
from airflow.utils.session import create_session
def my_call(x, y, *, session: Session):
...
# You MUST not commit the session here.
with create_session() as session:
my_call(x, y, session=session)
Warning
DO NOT add a default to the session
argument unless @provide_session
is used.
If this function is designed to be called by "end-users" (i.e. DAG authors) then using the @provide_session
wrapper is okay:
from sqlalchemy.orm import Session
from airflow.utils.session import NEW_SESSION, provide_session
@provide_session
def my_method(arg, *, session: Session = NEW_SESSION):
...
# You SHOULD not commit the session here. The wrapper will take care of commit()/rollback() if exception
In both cases, the session
argument is a keyword-only argument. This is the most preferred form if
possible, although there are some exceptions in the code base where this cannot be used, due to backward
compatibility considerations. In most cases, session
argument should be last in the argument list.
If you wish to compute the time difference between two events with in the same process, use
time.monotonic()
, not time.time()
nor timezone.utcnow()
.
If you are measuring duration for performance reasons, then time.perf_counter()
should be used. (On many
platforms, this uses the same underlying clock mechanism as monotonic, but perf_counter
is guaranteed to be
the highest accuracy clock on the system, monotonic is simply "guaranteed" to not go backwards.)
If you wish to time how long a block of code takes, use Stats.timer()
-- either with a metric name, which
will be timed and submitted automatically:
from airflow.stats import Stats
...
with Stats.timer("my_timer_metric"):
...
or to time but not send a metric:
from airflow.stats import Stats
...
with Stats.timer() as timer:
...
log.info("Code took %.3f seconds", timer.duration)
For full docs on timer()
check out `airflow/stats.py`_.
If the start_date of a duration calculation needs to be stored in a database, then this has to be done using datetime objects. In all other cases, using datetime for duration calculation MUST be avoided as creating and diffing datetime operations are (comparatively) slow.
Airflow Operators might have some fields added to the list of template_fields
. Such fields should be
set in the constructor (__init__
method) of the operator and usually their values should
come from the __init__
method arguments. The reason for that is that the templated fields
are evaluated at the time of the operator execution and when you pass arguments to the operator
in the DAG, the fields that are set on the class just before the execute
method is called
are processed through templating engine and the fields values are set to the result of applying the
templating engine to the fields (in case the field is a structure such as dict or list, the templating
engine is applied to all the values of the structure).
That's why we expect two things in case of template fields
:
- with a few exceptions, only self.field = field should be happening in the operator's constructor
- validation of the fields should be done in the
execute
method, not in the constructor because in the constructor, the field value might be a templated value, not the final value.
The exceptions are cases where we want to assign empty default value to a mutable field (list or dict) or when we have a more complex structure which we want to convert into a different format (say dict or list) but where we want to keep the original strings in the converted structure.
In such cases we can usually do something like this
def __init__(self, *, my_field: list[str] = None, **kwargs):
super().__init__(**kwargs)
my_field = my_field or []
self.my_field = my_field
The reason for doing it is that we are working on a cleaning up our code to have pre-commit hook that will make sure all the cases where logic (such as validation and complex conversion) is not done in the constructor are detected in PRs.
If you want to learn what are the options for your development environment, follow to the Development environments document.