Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test]: PythonPostCommit is Extremely Flaky #29214

Closed
1 of 16 tasks
jrmccluskey opened this issue Oct 31, 2023 · 14 comments · Fixed by #29334
Closed
1 of 16 tasks

[Failing Test]: PythonPostCommit is Extremely Flaky #29214

jrmccluskey opened this issue Oct 31, 2023 · 14 comments · Fixed by #29334
Assignees
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. failing test P1 permared python tests

Comments

@jrmccluskey
Copy link
Contributor

jrmccluskey commented Oct 31, 2023

What happened?

The apache_beam/io/external/xlang_kinesisio_it_test.py::CrossLanguageKinesisIOTest::test_kinesis_write test is failing in the Python PostCommit with a consistent error message:

botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL: "http://localhost:32770/".

The test is defined here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/xlang_kinesisio_it_test.py#L94

Specifically, the failure is in create_stream():

      if self.use_localstack:
>       self.kinesis_helper.create_stream(self.aws_kinesis_stream)

Issue Failure

Failure: Test is continually failing

Issue Priority

Priority: 1 (unhealthy code / failing or flaky postcommit so we cannot be sure the product is healthy)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@AnandInguva
Copy link
Contributor

Is this still relevant?

@jrmccluskey
Copy link
Contributor Author

Yep still a problem, @damccorm said there had been a few issues along these lines with the self-hosted runners

@volatilemolotov
Copy link
Contributor

Im looking into this one, there are issues with runners that soon will be fixed but that is still not the final fix needed for python postcommit, keep you posted

@volatilemolotov
Copy link
Contributor

.take-issue

@damccorm
Copy link
Contributor

damccorm commented Nov 7, 2023

@volatilemolotov I don't think you meant to auto-close this with the PR, is that right? If yes, we can reclose after a green signal anyways I guess

@damccorm damccorm reopened this Nov 7, 2023
@volatilemolotov
Copy link
Contributor

I did not meant to autoclose its only a part. Sorry im not aware of mechanisms, worked on a lot of different systems :)

@damccorm
Copy link
Contributor

damccorm commented Nov 7, 2023

No worries, its actually a GitHub feature - https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword

@github-actions github-actions bot added this to the 2.53.0 Release milestone Nov 7, 2023
@volatilemolotov
Copy link
Contributor

So a scheduled run failed with one actually passing. Other three jobs fail in different places
https://github.com/apache/beam/actions/runs/6794201259/job/18470220889#step:9:37734

Any ideas what is going on? Could it be because of parallel run (I had a full green run in my fork)

@damccorm
Copy link
Contributor

damccorm commented Nov 8, 2023

Definitely seems like we've upgraded from permared jobs to test flakiness, so I don't think this is a runner/actions problem anymore. For example https://github.com/apache/beam/actions/runs/6797617071/job/18480096798 already has 3 green jobs (with a 4th still running)

At least some of it is caused by #29076 - I see a bunch of failures related to that test in the workflow you linked.

I have #29197 to fix that, was holding off on merging since there was a lot going on causing issues, but it might be time to merge. I'm running https://github.com/apache/beam/actions/runs/6802293125 to make sure I'm correctly sickbaying it, but once that runs (assuming its working as expected) I think we should merge the PR.

@volatilemolotov
Copy link
Contributor

The flow you referenced is green
https://github.com/apache/beam/actions/runs/6797617071/job/18480096798

So yeah, flakiness. Glad to have it sorted we were lucky that MTU issues did not cause bigger problems

@jrmccluskey jrmccluskey changed the title [Failing Test]: PythonPostCommit is Perma-Red [Failing Test]: PythonPostCommit is Extremely Flaky Dec 6, 2023
@Abacn
Copy link
Contributor

Abacn commented Jan 10, 2024

Now it is still flaky though with lower frequency: https://github.com/apache/beam/runs/20274778848

There is other flaky test. e.g.

apache_beam.examples.fastavro_it_test.FastavroIT

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 636, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1611, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/apache_beam/io/filebasedsource.py", line 380, in process
    source = self._source_from_file(metadata.path)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/apache_beam/io/filebasedsource.py", line 127, in __init__
    self._validate()
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/apache_beam/options/value_provider.py", line 193, in _f
    return fnc(self, *args, **kwargs)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/apache_beam/io/filebasedsource.py", line 190, in _validate
    raise IOError('No files found based on the file pattern %s' % pattern)
OSError: No files found based on the file pattern gs://temp-storage-for-end-to-end-tests/py-it-cloud/output/e26d0c72-c41a-43e4-aa51-598e6994b277/fastavro-00003-of-00004

@lostluck
Copy link
Contributor

There's one week until the 2.54.0 cut and this issue is tagged for that release, if possible/necessary please complete the necessary work before then, or move this to the 2.55.0 Release Milestone.

This one seems like we may need to cherry pick though if additional fixes occur.

@volatilemolotov
Copy link
Contributor

Its still flaky, failing differently

https://github.com/apache/beam/actions/runs/7563095321/job/20594841089#step:9:37337

Cloud be the test is somehow broken but i cannot see a pattern right now

@lostluck
Copy link
Contributor

lostluck commented Feb 6, 2024

Closing this one, as all the python postcommits are passing on the release branch.

@lostluck lostluck closed this as completed Feb 6, 2024
@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. failing test P1 permared python tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants