-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split providers out of the main "airflow/" tree into a UV workspace project #42505
Conversation
7171386
to
9331a5a
Compare
Oh, prepare provider packages is failing for prod image build. I guess that's going to have to be in this PR then |
056127e
to
2e952e5
Compare
Smalll comment (I am returning back from holidays) and slowly going through the two week's history.
I tried various things before (including There are two problems with using uv cache:
UV cache benefits from the fact that the same cache is potentially reused by multiple projects you have installed on the same machine. The UV cache is stored in HOME dir and reused between all venvs used in the same machine - so in docker it's nearly useless because it multiplies the space used for both - installed and cached files but cached files are not reusable in any way for other venvs (because we have only one venv in docker container). And since layers in docker image cannot be removed, even if we delete the cache after final installation step - the cache will remain in the docker image layer additionally to installed packages - increasing significantly size of the final image. Of course - you could potentially use locally mounted cache volumes in docker to keep the UV cache on, in order to not pollute docke image, but that would only be useful after you build your image locally once and you won't be able to use remotely cached layers which are currently used when you build the image. Using remote cache in the way we do it now, is much nicer, because the cache is actually the same as most installed packages (with the exception of the new/ updated packages change after the original cached layer is created. This has the nice property that every time new patchlevel base image is released by python (every two-three weeks or so) the cache is rebuit from scratch, and we get a new "fresh" main cache with latest package versions.
Whatever that command will produce as cache will be invalidated the moment the source dependency files (pyproject.toml, hatch_build.py, provider.yaml files, generated/provider_dependencies.json) - so if you want to effectively use cache, you need to build it BEFORE you copy any of those files into the new image. This is precisely what "our" caching is doing - it does not use any "local file" and does not copy them to the image, but uses "https://github" URL to install the dependencies for the first time - which means that next time that layer will not be invalidated when pyproject.toml or hatch_build.py or any other files that decide about dependencies will change. Now - I'd love to make it simpler, but so far all the attempts to find a better solution failed because the one we have is the only one that has good "caching" properties. |
But maybe you can come up with a better approach that I have not thought about and tried already, of course :) |
Yeah, that comment was mostly a note to myself, and not something I plan on even looking at as part of this work. |
ecf9f37
to
7cb016b
Compare
d7e5864
to
ddc7402
Compare
ddc7402
to
9cd5774
Compare
I think I'm getting closer now 🤞🏻 . Oddly enough a lot of the time the er diagram static check is failing in CI , even though I've run Edit: this turned out to be a (purposefully) untracked file in the migrations folder in my local checkout. |
0956d95
to
36ab177
Compare
…#43307) While the #43260 attempted to address the problem where example dag importability tests should skip provider tests on non-main, it did not actually solve the problem. While debugging it, it turned out that since #42505, the provider tests were not executed in main "at all" - the "providers" directory was not included in the list of places to check for the example dags (they were in "airflow" in v2-10-test") this is why it "looked like" the solution worked in "main". This PR fixes both problems: * brings back importability of provider's example_dags in main branch * properly excludes the providers examples in non-main branch
Another teething problem after moving providers in apache#42505. After moving providers, the history of the current folder in "providers" only contains changes after the move - it does not include changes from before the move - and since we always regenerate the full list of commits - they were missing. We cannot use `--follow` - because `git log --follow` only works for single files, not directories, but since the move was very predictable ("airflow/providers/nnn" -> "airflow/providers/src/airflow/providers/nnn") we can add the old path to `git log` command to get both - pre and post move commit history.
Another teething problem after moving providers in apache#42505. After moving providers, the history of the current folder in "providers" only contains changes after the move - it does not include changes from before the move - and since we always regenerate the full list of commits - they were missing. We cannot use `--follow` - because `git log --follow` only works for single files, not directories, but since the move was very predictable ("airflow/providers/nnn" -> "airflow/providers/src/airflow/providers/nnn") we can add the old path to `git log` command to get both - pre and post move commit history.
Another teething problem after moving providers in #42505. After moving providers, the history of the current folder in "providers" only contains changes after the move - it does not include changes from before the move - and since we always regenerate the full list of commits - they were missing. We cannot use `--follow` - because `git log --follow` only works for single files, not directories, but since the move was very predictable ("airflow/providers/nnn" -> "airflow/providers/src/airflow/providers/nnn") we can add the old path to `git log` command to get both - pre and post move commit history.
After apache#42505, you need to get through extra hoops to develop providers in Airflow's monorepo. This is a simple (not yet uv-specific) documentation on how to install providers in editable mode when you want to develop providers, so that you can run unit test. Copied mostly from apache#43082 It's not yet full set of docs explaining how to use workspaces and UV. This shoudl be handled via apache#43200
…3468) * Add simple instructions for installing providers in editable mode After #42505, you need to get through extra hoops to develop providers in Airflow's monorepo. This is a simple (not yet uv-specific) documentation on how to install providers in editable mode when you want to develop providers, so that you can run unit test. Copied mostly from #43082 It's not yet full set of docs explaining how to use workspaces and UV. This shoudl be handled via #43200 * Apply suggestions from code review Co-authored-by: Amogh Desai <[email protected]> --------- Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Amogh Desai <[email protected]>
This was never supposed to work. the parameters that you can pass to "testing tests" is really "extra pytest paramaters" you can pass - and it's not really inteded to use "pytest files" there. That's why for example the extra parameters are not autocomplete'able with files/paths. The "tests" command is really meant to run the whole "test type" and you can easily use it now. For example running all airbyte tests can be achieved by: breeze testing tests --test-type "Providers[airbyte]" And it nicely works now. |
(so I think indeed all known "teething" problems" are solved - and i can try to attempt to split out individual providers). |
When providers have been moved in apache#42505 broke passing parameters when provider tests were passed as extra args of "testing tests" command of breeze. Previously all tests were under "tests" folder and there was an exclusion to disable "All" tests when any test was passed as parameter. But after moving tests to "providers" this stopped working. Additional exclusion needs to be added for "providers/tests" and "providers/tests_sdk/". This PR also adds autocompletion for tests passed this way by setting the click type to Path for the extra args (but without the need for the Path to exist). Also during this check it turned out that "All" tests are not working in the intended way - but this should not impact our CI only local runs. Appropriate comment has been added and it's captured in apache#42632
When providers have been moved in apache#42505 broke passing parameters when provider tests were passed as extra args of "testing tests" command of breeze. Previously all tests were under "tests" folder and there was an exclusion to disable "All" tests when any test was passed as parameter. But after moving tests to "providers" this stopped working. Additional exclusion needs to be added for "providers/tests" and "providers/tests_sdk/". This PR also adds autocompletion for tests passed this way by setting the click type to Path for the extra args (but without the need for the Path to exist). Also during this check it turned out that "All" tests are not working in the intended way - but this should not impact our CI only local runs. Appropriate comment has been added and it's captured in apache#42632
When providers have been moved in #42505 broke passing parameters when provider tests were passed as extra args of "testing tests" command of breeze. Previously all tests were under "tests" folder and there was an exclusion to disable "All" tests when any test was passed as parameter. But after moving tests to "providers" this stopped working. Additional exclusion needs to be added for "providers/tests" and "providers/tests_sdk/". This PR also adds autocompletion for tests passed this way by setting the click type to Path for the extra args (but without the need for the Path to exist). Also during this check it turned out that "All" tests are not working in the intended way - but this should not impact our CI only local runs. Appropriate comment has been added and it's captured in #42632
The compatibility tests in CI are using providers built as packages from sources, so the compatibility tests run there using "providers/tests" work just fine, because all providers are installed in the airflow.providers site library. However when we are iterating and debugging backwards compatiblity provider tests, we should be able to use local provider sources, rather than installed packages and we have the possibility of mounting both - providers sources and tests to the image. See `contributing-docs/testing/unit_tests.rst` on how to do it by using ``--mount-sources providers-and-tests`` flag connected with `--use-airflow-version`. However as of apache#42505 this has been broken, because currently in main we rely on airflow having "pkgutil" namespace package for both - airflow, and airflow.providers packages (previous airflow versions had implicit package for airflow.providers package) - so providers installed locally cannot be used as "another" source of providers. Previously it was working because both "installed" and "sources" `airflow.providers` package were implicit namespace packages. As explained in https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages > Every distribution that uses the namespace package must include such > an `__init__.py`. If any distribution does not, it will cause the > namespace logic to fail and the other sub-packages will not be > importable. Any additional code in __init__.py will be inaccessible. So because old airflow uses implicit provider's packages and main airflow from source uses "explicit" provider's package, the only way we can make the "source" providers is to mount them or symbolically link them to inside installed distribution of airflow package (in site directory) (or dynamically remove the __init__.py from provider's source directory. We cannot mount the provider package sources ot inside the installed airflow - because when --use-airflow-version is used, airflow is installed dynamically inside the container - after the container is started. This PR solves the problem by adding an env variable that will make the initialization script to remove the installed airflow.providers folder after installing airflow and linking the "providers/src/airflow/providers" folder there. This has the added benefit that all providers (including the preinstalled ones) are used from "main" sources rather than from installed packages - which was problematic for the past way of using providers from sources - which used the fact that both "airflow.providers" in the site-library and the one in sources were implicit namespace packages.
…ws (#43617) * Enable back iterative development of latest providers with old airflows The compatibility tests in CI are using providers built as packages from sources, so the compatibility tests run there using "providers/tests" work just fine, because all providers are installed in the airflow.providers site library. However when we are iterating and debugging backwards compatiblity provider tests, we should be able to use local provider sources, rather than installed packages and we have the possibility of mounting both - providers sources and tests to the image. See `contributing-docs/testing/unit_tests.rst` on how to do it by using ``--mount-sources providers-and-tests`` flag connected with `--use-airflow-version`. However as of #42505 this has been broken, because currently in main we rely on airflow having "pkgutil" namespace package for both - airflow, and airflow.providers packages (previous airflow versions had implicit package for airflow.providers package) - so providers installed locally cannot be used as "another" source of providers. Previously it was working because both "installed" and "sources" `airflow.providers` package were implicit namespace packages. As explained in https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages > Every distribution that uses the namespace package must include such > an `__init__.py`. If any distribution does not, it will cause the > namespace logic to fail and the other sub-packages will not be > importable. Any additional code in __init__.py will be inaccessible. So because old airflow uses implicit provider's packages and main airflow from source uses "explicit" provider's package, the only way we can make the "source" providers is to mount them or symbolically link them to inside installed distribution of airflow package (in site directory) (or dynamically remove the __init__.py from provider's source directory. We cannot mount the provider package sources ot inside the installed airflow - because when --use-airflow-version is used, airflow is installed dynamically inside the container - after the container is started. This PR solves the problem by adding an env variable that will make the initialization script to remove the installed airflow.providers folder after installing airflow and linking the "providers/src/airflow/providers" folder there. This has the added benefit that all providers (including the preinstalled ones) are used from "main" sources rather than from installed packages - which was problematic for the past way of using providers from sources - which used the fact that both "airflow.providers" in the site-library and the one in sources were implicit namespace packages. * Update Dockerfile.ci Co-authored-by: GPK <[email protected]> * Update scripts/docker/entrypoint_ci.sh Co-authored-by: GPK <[email protected]> --------- Co-authored-by: GPK <[email protected]>
This is a no-op change right now, but as part of the provider re-org in apache#42505 this sets us up to be able to load the providers code in the tests The reason this change is done separately is that changes to breeze code form forks doesn't take effect, and this small change makes it easier to land on main without having to re-create that large PR.
…roject (apache#42505) This is only a partial split so far. It moves all the code and tests, but leaves the creation of `core/` to a separate PR as this is already large enough. In addition to the straight file rename the other changes I had to make here are: - Some mypy/typing fixes. Mypy can be fragile about what it picks up when, so maybe some of those changes were caused by that. But the typing changes aren't large. - Improve typing in common.sql type stub Again, likely a mypy file oddity, but the types should be safe - Removed the `check-providers-init-file-missing` check This isn't needed now that airflow/providers shouldn't exist at all in the main tree. - Create a "dev.tests_common" package that contains helper files and common pytest fixtures Since the provider tests are no longer under tests/ they don't automatically share the fixtures from the parent `tests/conftest.py` so they needed extracted. Ditto for `tests.test_utils` -- they can't be easily imported in provider tests anymore, so they are moved to a more explicit shared location. In future we should switch how the CI image is built to make better use of UV caching than our own approach as that would remvoe a lot of custom code.
…roject (apache#42505) (apache#42624) This is only a partial split so far. It moves all the code and tests, but leaves the creation of `core/` to a separate PR as this is already large enough. In addition to the straight file rename the other changes I had to make here are: - Some mypy/typing fixes. Mypy can be fragile about what it picks up when, so maybe some of those changes were caused by that. But the typing changes aren't large. - Improve typing in common.sql type stub Again, likely a mypy file oddity, but the types should be safe - Removed the `check-providers-init-file-missing` check This isn't needed now that airflow/providers shouldn't exist at all in the main tree. - Create a "dev.tests_common" package that contains helper files and common pytest fixtures Since the provider tests are no longer under tests/ they don't automatically share the fixtures from the parent `tests/conftest.py` so they needed extracted. Ditto for `tests.test_utils` -- they can't be easily imported in provider tests anymore, so they are moved to a more explicit shared location. In future we should switch how the CI image is built to make better use of UV caching than our own approach as that would remvoe a lot of custom code. Co-authored-by: Ash Berlin-Taylor <[email protected]> Co-authored-by: Ryan Hatter <[email protected]>
* Move tests common without changes * Fix docstrings in tests_common * Move tests_common from "dev" to top-level. Follow-up after apache#42505 fixing teething problem with tests_common. Originally in apache#42505 common test code was moved to "dev" folder, but the "dev" folder is really dedicated to "build" scripts and the problem with moving "tests_common" to the folder was that the whole "dev" folder is replaced (for non-committer PRs) with the content from the target branch. This is done for security reasons, because we can accidentally use any of the scripts from dev in the CI build scripts and we might not notice, which will open us to a security issue where a file in "dev" coming from PR could be accidentally executed during the "pull_request_target" workflow - which would expose our secrets and GitHub Package write permissions to a contributor coming from a fork. This change moves the files, fixes pre-commit specification and docs, also fixes a number of "doc" issues detected by "ruff" in the tests_common folder as they were detected after the move. The tests_common folder is added to folders mounted when breeze is executed with local folders mounted (in order to avoid accidental mounting of randomly generated files to inside the breeze container). All imports for the common tests were updated to reflect this move.
…ache#43173) Cleans-up airflow and providers `__init__.py" files in order to get providers import work again. This is done by excluding the two `__init__.py` files from automated ruff isort rules adding `from __future__ import annotations`. That should finally get rid of the Intellij teething import problem that has been introduced in apache#42505. There were earlier - unsuccessful - attempts to fix it in the apache#43116 and apache#43081 that followed apache#42951 - but the key is that Pycharm requires the namespace's extend_path to be first "real" line of code in the `__init__.py` to understand that the package is an "explicit" namespace package - and it conflicts with the requirement of "from __future__ import annotations" to be the first line of Python code. Also this PR fixes following problem: * pytest_plugin expecting .asf.yml in "airflow" sources - even during compatibility tests with older version of airflow (where the .asf.yml is not present) --------- Co-authored-by: Kaxil Naik <[email protected]>
…43273) While the apache#43260 attempted to address the problem where example dag importability tests should skip provider tests on non-main, it did not actually solve the problem. While debugging it, it turned out that since apache#42505, the provider tests were not executed in main "at all" - the "providers" directory was not included in the list of places to check for the example dags (they were in "airflow" in v2-10-test") this is why it "looked like" the solution worked in "main". This PR fixes both problems: * brings back importability of provider's example_dags in main branch * properly excludes the providers examples in non-main branch
Another teething problem after moving providers in apache#42505. After moving providers, the history of the current folder in "providers" only contains changes after the move - it does not include changes from before the move - and since we always regenerate the full list of commits - they were missing. We cannot use `--follow` - because `git log --follow` only works for single files, not directories, but since the move was very predictable ("airflow/providers/nnn" -> "airflow/providers/src/airflow/providers/nnn") we can add the old path to `git log` command to get both - pre and post move commit history.
…ache#43468) * Add simple instructions for installing providers in editable mode After apache#42505, you need to get through extra hoops to develop providers in Airflow's monorepo. This is a simple (not yet uv-specific) documentation on how to install providers in editable mode when you want to develop providers, so that you can run unit test. Copied mostly from apache#43082 It's not yet full set of docs explaining how to use workspaces and UV. This shoudl be handled via apache#43200 * Apply suggestions from code review Co-authored-by: Amogh Desai <[email protected]> --------- Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Amogh Desai <[email protected]>
…43529) When providers have been moved in apache#42505 broke passing parameters when provider tests were passed as extra args of "testing tests" command of breeze. Previously all tests were under "tests" folder and there was an exclusion to disable "All" tests when any test was passed as parameter. But after moving tests to "providers" this stopped working. Additional exclusion needs to be added for "providers/tests" and "providers/tests_sdk/". This PR also adds autocompletion for tests passed this way by setting the click type to Path for the extra args (but without the need for the Path to exist). Also during this check it turned out that "All" tests are not working in the intended way - but this should not impact our CI only local runs. Appropriate comment has been added and it's captured in apache#42632
…ws (apache#43617) * Enable back iterative development of latest providers with old airflows The compatibility tests in CI are using providers built as packages from sources, so the compatibility tests run there using "providers/tests" work just fine, because all providers are installed in the airflow.providers site library. However when we are iterating and debugging backwards compatiblity provider tests, we should be able to use local provider sources, rather than installed packages and we have the possibility of mounting both - providers sources and tests to the image. See `contributing-docs/testing/unit_tests.rst` on how to do it by using ``--mount-sources providers-and-tests`` flag connected with `--use-airflow-version`. However as of apache#42505 this has been broken, because currently in main we rely on airflow having "pkgutil" namespace package for both - airflow, and airflow.providers packages (previous airflow versions had implicit package for airflow.providers package) - so providers installed locally cannot be used as "another" source of providers. Previously it was working because both "installed" and "sources" `airflow.providers` package were implicit namespace packages. As explained in https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#native-namespace-packages > Every distribution that uses the namespace package must include such > an `__init__.py`. If any distribution does not, it will cause the > namespace logic to fail and the other sub-packages will not be > importable. Any additional code in __init__.py will be inaccessible. So because old airflow uses implicit provider's packages and main airflow from source uses "explicit" provider's package, the only way we can make the "source" providers is to mount them or symbolically link them to inside installed distribution of airflow package (in site directory) (or dynamically remove the __init__.py from provider's source directory. We cannot mount the provider package sources ot inside the installed airflow - because when --use-airflow-version is used, airflow is installed dynamically inside the container - after the container is started. This PR solves the problem by adding an env variable that will make the initialization script to remove the installed airflow.providers folder after installing airflow and linking the "providers/src/airflow/providers" folder there. This has the added benefit that all providers (including the preinstalled ones) are used from "main" sources rather than from installed packages - which was problematic for the past way of using providers from sources - which used the fact that both "airflow.providers" in the site-library and the one in sources were implicit namespace packages. * Update Dockerfile.ci Co-authored-by: GPK <[email protected]> * Update scripts/docker/entrypoint_ci.sh Co-authored-by: GPK <[email protected]> --------- Co-authored-by: GPK <[email protected]>
closes #42857
As discussed https://lists.apache.org/thread/dyv5jhvt65xs6l5o2byc2b67f4wlwf6r, this is the first part of the new layout.
This is only a partial split so far. It moves all the code and tests, but the provider release generation code hasn't moved yet as it's a few weeks until the next batch, and I wan't to try and keep this already huge PR as small as possible. Likewise the creation of
core/
will be a follow on PRIn addition to the straight file rename the other changes I had to make here are:
Some mypy/typing fixes.
Mypy can be fragile about what it picks up when, so maybe some of those
changes were caused by that. But the typing changes aren't large.
Improve typing in common.sql type stub
Again, likely a mypy file oddity, but the types should be safe
Removed the
check-providers-init-file-missing
checkThis isn't needed now that airflow/providers shouldn't exist at all in the
main tree.
Create a "dev.tests_common" package that contains helper files and common
pytest fixtures
Since the provider tests are no longer under tests/ they don't automatically
share the fixtures from the parent
tests/conftest.py
so they neededextracted.
Ditto for
tests.test_utils
-- they can't be easily imported in providertests anymore, so they are moved to a more explicit shared location.
In future we should switch how the CI image is built to make better use of UV caching than our own approach as that would remvoe a lot of custom code.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.