Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostCommit Python Dependency job is flaky #30799

Closed
github-actions bot opened this issue Mar 29, 2024 · 15 comments
Closed

The PostCommit Python Dependency job is flaky #30799

github-actions bot opened this issue Mar 29, 2024 · 15 comments

Comments

@github-actions
Copy link
Contributor

The PostCommit Python Dependency is failing over 50% of the time
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python_Dependency.yml?query=is%3Afailure+branch%3Amaster to see the logs.

@liferoad
Copy link
Collaborator

@github-actions github-actions bot added this to the 2.56.0 Release milestone Apr 13, 2024
@github-actions github-actions bot reopened this Oct 12, 2024
Copy link
Contributor Author

Reopening since the workflow is still flaky

@liferoad liferoad removed this from the 2.56.0 Release milestone Oct 29, 2024
@liferoad
Copy link
Collaborator

2024-10-29T11:15:12.3297202Z       if not SentenceTransformer:
2024-10-29T11:15:12.3297585Z >       raise ImportError(
2024-10-29T11:15:12.3298106Z             "sentence-transformers is required to use "
2024-10-29T11:15:12.3298603Z             "SentenceTransformerEmbeddings."
2024-10-29T11:15:12.3299255Z             "Please install it with using `pip install sentence-transformers`.")
2024-10-29T11:15:12.3300468Z E       ImportError: sentence-transformers is required to use SentenceTransformerEmbeddings.Please install it with using `pip install sentence-transformers`.

@damccorm
Copy link
Contributor

@github-actions github-actions bot added this to the 2.61.0 Release milestone Oct 30, 2024
@github-actions github-actions bot reopened this Nov 24, 2024
Copy link
Contributor Author

Reopening since the workflow is still flaky

@liferoad
Copy link
Collaborator

liferoad commented Nov 24, 2024

target/.tox-py39-tensorflow-212/py39-tensorflow-212/lib/python3.9/site-packages/tensorflow/python/util/nest_util.py:920: in _tf_core_pack_sequence_as
    return sequence_fn(structure, packed)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

instance = {}, args = []

    def sequence_like(instance, args):
      """Converts the sequence `args` to the same type as `instance`.
    
      Args:
        instance: an instance of `tuple`, `list`, `namedtuple`, `dict`,
          `collections.OrderedDict`, or `composite_tensor.Composite_Tensor` or
          `type_spec.TypeSpec`.
        args: items to be converted to the `instance` type.
    
      Returns:
        `args` with the type of `instance`.
      """
      if _is_mutable_mapping(instance):
        # Pack dictionaries in a deterministic order by sorting the keys.
        # Notice this means that we ignore the original order of `OrderedDict`
        # instances. This is intentional, to avoid potential bugs caused by mixing
        # ordered and plain dicts (e.g., flattening a dict but using a
        # corresponding `OrderedDict` to pack it back).
>       result = dict(zip(_tf_core_sorted(instance), args))
E       Failed: Timeout >600.0s [while running 'RunInference/BeamML_RunInference']
...
=========================== short test summary info ============================
FAILED apache_beam/ml/inference/tensorflow_inference_test.py::TFRunInferenceTest::test_predict_tensor_with_batch_size
FAILED apache_beam/ml/inference/tensorflow_inference_test.py::TFRunInferenceTest::test_predict_tensor_with_large_model
FAILED apache_beam/ml/inference/tensorflow_inference_test.py::TFRunInferenceTest::test_predict_numpy_with_batch_size
FAILED apache_beam/ml/inference/tensorflow_inference_test.py::TFRunInferenceTest::test_predict_numpy_with_large_model
====== 4 failed, 10 passed, 11 skipped, 18 warnings in 1244.76s (0:20:44) ======
py39-tensorflow-212: exit 1 (1247.30 seconds) /runner/_work/beam/beam/sdks/python/test-suites/tox/py39/build/srcs/sdks/python> /bin/sh -c 'pytest -o junit_suite_name=py39-tensorflow-212 --junitxml=pytest_py39-tensorflow-212.xml -n 6 -m uses_tf ; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' pid=18090
py39-tensorflow-212: commands_post[0]> bash /runner/_work/beam/beam/sdks/python/test-suites/tox/py39/build/srcs/sdks/python/scripts/run_tox_cleanup.sh
  py39-tensorflow-212: FAIL code 1 (1605.92=setup[356.45]+cmd[0.01,0.24,1.38,0.03,0.44,1247.30,0.07] seconds)
  evaluation failed :( (1606.10 seconds)

> Task :sdks:python:test-suites:tox:py39:testPy39tensorflow-212 FAILED

@damccorm damccorm removed this from the 2.61.0 Release milestone Nov 25, 2024
@kennknowles
Copy link
Member

Should this be escalated to a clearer bug about incompatibility with Tensorflow 2.12? Is it new and reproducible?

@kennknowles
Copy link
Member

$ git log 3f8fabeeb532fa8fca3a50a75ed21905ca44fc11..a06454a22084242b4fe089b570fd090810885885 --oneline
a06454a2208 Upgrade GCP-BOM to 26.49.0 (#32864)
10c7eb34af1 fix the flink runner doc (#33182)
5d3088f7489 Bump zetasql version to 2024.11.1 (#32902)
d2e9928da49 Revert "Fixed the broken beam python on flink with PortableRunner" (#33178)
fc9083b35f9 Create Python SDK Distroless variant (#33160)

@kennknowles
Copy link
Member

kennknowles commented Nov 25, 2024

Succeeded at 3f8fabe then perma-red since a06454a

@kennknowles
Copy link
Member

@damondouglas @liferoad each have a Python-related commit in there.

@kennknowles
Copy link
Member

kennknowles commented Nov 25, 2024

Confirmed by random inspection that the Tensorflow 2.12 variant is always the failure, and always due to timeout.

@liferoad
Copy link
Collaborator

I suspect some race conditions when we save the models withe the same name. Will try to do some fixes later.

@liferoad
Copy link
Collaborator

liferoad commented Nov 26, 2024

The past successful job has this:

2024-11-20T17:48:30.5153308Z py39-tensorflow-212: commands_pre[2]> pip check
2024-11-20T17:48:32.3144642Z No broken requirements found.
2024-11-20T17:48:32.4144869Z py39-tensorflow-212: commands_pre[3]> bash /runner/_work/beam/beam/sdks/python/test-suites/tox/py39/build/srcs/sdks/python/scripts/run_tox_cleanup.sh
2024-11-20T17:48:32.5144417Z py39-tensorflow-212: commands[0]> /bin/sh -c 'pip freeze | grep -E tensorflow'
2024-11-20T17:48:33.0143264Z tensorflow==2.18.0
2024-11-20T17:48:33.0146899Z tensorflow-estimator==2.12.0
2024-11-20T17:48:33.0149237Z tensorflow-hub==0.16.1
2024-11-20T17:48:33.0156149Z tensorflow-io-gcs-filesystem==0.37.1

The recently failed one has:

2024-11-25T20:55:45.8184959Z py39-tensorflow-212: commands[0]> /bin/sh -c 'pip freeze | grep -E tensorflow'
2024-11-25T20:55:46.2184387Z tensorflow==2.15.1
2024-11-25T20:55:46.2185214Z tensorflow-estimator==2.15.0
2024-11-25T20:55:46.2185940Z tensorflow-hub==0.16.1
2024-11-25T20:55:46.2187825Z tensorflow-io-gcs-filesystem==0.37.1
2024-11-25T20:55:46.2190040Z tensorflow-metadata==1.15.0
2024-11-25T20:55:46.2192088Z tensorflow-serving-api==2.15.1
2024-11-25T20:55:46.2193701Z tensorflow-transform==1.15.0

Very strange.

@jrmccluskey
Copy link
Contributor

That test environment is defined here

Even the 2.18.0 job is wrong, this test environment is supposed to be pinned to TF 2.12

@liferoad
Copy link
Collaborator

liferoad commented Dec 9, 2024

Green for the last two weeks.

@liferoad liferoad closed this as completed Dec 9, 2024
@github-actions github-actions bot added this to the 2.62.0 Release milestone Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants