-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The PostCommit Python job is flaky #30513
Comments
It first failed on https://github.com/apache/beam/actions/runs/8210266873. The failed task is Traceback:
|
Added the owner of the commit whose post-commit job failed at the first time. |
I think we can pretty comfortably rule out that change, it was to the yaml sdk which is unrelated to portableWordCountSparkRunnerBatch. Note that this runs on a schedule, not on commits, though none of the commits in that scheduled time look particularly harmful |
I see. It was red for the last two weeks and flaky before that too. |
Permared right now |
Only sorta - each component job is actually not permared - e.g. there are 2 successes here, https://github.com/apache/beam/actions/runs/8873798546 The whole workflow is permared just because our flake percentage is so high |
Yea, let's work out how to get top-level signal. |
The lowest and highest Python version (3.8, 3.11) are running more tests than (3.9, 3.10), could be those tests or task permared |
Could make sense to find a way to get separate top-level signal for Python versions, assuming we can use software engineering to share everything necessary so they don't get out of sync. |
Yeah, we used to have this for Jenkins where each Python PostCommit had its own task |
The Vertex AI package version issue (we do not import this directly. So it should be fine.):
|
A new flaky test in py39 and this is related to #29617: https://ge.apache.org/s/hb7syztoolfhu/console-log?page=17
|
Great. Thanks @liferoad |
Reopening since the workflow is still flaky |
2024-08-30T07:28:39.6571287Z if setup_options.setup_file is not None: |
Currently failing test: gradlew :sdks:python:test-suites:portable:py312:portableLocalRunnerJuliaSetWithSetupPy |
This is red again - https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml?query=branch%3Amaster It looks like there are currently 2 issues:
|
@jrmccluskey would you mind taking a look at these? |
Failure in the 3.9 postcommit is apache_beam/examples/fastavro_it_test.py::FastavroIT::test_avro_it, will dive deeper into that shortly |
The problem in the TensorRT container is that we seem to have two different versions of CUDA installed, one at version 11.8 and the other at 12.1 (we want everything at 12.1) |
Looks like after sickbaying TensorRT tests, there are still failures. https://ge.apache.org/s/27igat7sfmcsu/console-log/task/:sdks:python:test-suites:portable:py310:portableWordCountSparkRunnerBatch?anchor=60&page=1 is an example, it looks like we're failing because we're missing a class in the spark runner. @Abacn would you mind taking a look? Its unclear why this is happening now, but I'm guessing it may be related to #32976 (and maybe some caching kept it from showing up?) |
It's bad gradle cache. Cannot reproduce locally on master branch. Also inspected the expansion jar. For some reason, recently, Gradle cache for shadowJar breaks more frequently |
It started to fail last week again (Friday days ago) since the distroless python sdk PR: 81f35ab (@damondouglas)
|
There is no
@damondouglas , could you confirm that? |
sg, will see if the fix in my mind will can work. |
Ok, take another look at this.
The Kafka error message is shown below:
@Abacn , could you check this and see if we need to roll it back? |
Thanks for taking care of it. I am +1 for rollback. The first distroless PR was expected to be a no-op for 2.61.0 release. Good to know it broke something before release cut. |
green now. |
The PostCommit Python is failing over 50% of the time
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml?query=is%3Afailure+branch%3Amaster to see the logs.
The text was updated successfully, but these errors were encountered: