[Bug]: Number of Python examples are failing for Flink and Spark on 2.43.0 release branch #23907

chamikaramj · 2022-10-31T20:52:11Z

What happened?

Seems like following Python examples are failing for Flink and Spark on 2.43.0 release branch.

bigquery_tornadoes
hourly_team_score
filters_test

For example,
https://ci-beam.apache.org/job/beam_PostCommit_Python_Examples_Spark_PR/14/
https://ci-beam.apache.org/job/beam_PostCommit_Python_Examples_Flink_PR/12/

For bigquery_tornadoes, the error is following.

RuntimeError: Pipeline BeamApp-jenkins-1028002928-4325f8c5_d8258ff0-6727-4335-867f-56bc846d9f3e failed in state FAILED: java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_52, PCollection=unique_name: "61Write/BigQueryBatchFileLoads/TriggerLoadJobsWithoutTempTables.None"
coder_id: "ref_Coder_FastPrimitivesCoder_3"
is_bounded: BOUNDED
windowing_strategy_id: "ref_Windowing_Windowing_1"
}] were consumed but never produced

I found #21300 that probably explain some example failures but this doesn't explain all failures above.

Valentyn, are there any known issues that explain these failures ?

Issue Priority

Priority: 0

Issue Component

Component: examples-python

chamikaramj · 2022-10-31T20:52:56Z

P0 since this is blocking the ongoing Beam release.

tvalentyn · 2022-10-31T21:10:58Z

Did these pass on previous release? If so, errors should be bisectable.

Abacn · 2022-10-31T21:18:44Z

Looks like failed tests are all BigQuery tests. Both have been failed after Sept 14th. Possible related changes on the Day of 14th: #23122 and/or #23012

tvalentyn · 2022-11-01T00:09:24Z

Thanks, @Abacn . The first change affects go sdk only. #23012 looks suspicious. @ahmedabu98 could you please take a look?

chamikaramj · 2022-11-01T17:12:12Z

Confirmed that this:

Passes on 2.42.0 branch
Fails on 2.43.0 branch
Fails on master

Will also try a revert.

chamikaramj · 2022-11-01T18:42:25Z

Fails for commit ac37784

Passes for commit 2d4f61c (one before above).

So the culprit seems to be ac37784

@ahmedabu98 can you please take a look ? We can either do a forward fix in the release branch or revert this if we can revert cleanly.

To reproduce locally, you can run the command below.

./gradlew :sdks:python:test-suites:portable:py37:flinkExamples

ahmedabu98 · 2022-11-01T19:04:13Z

Seeing a lot of ValueError: Unable to parse jar URL "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Examples_Spark/src/runners/spark/2/job-server/build/libs/beam-runners-spark-3-job-server-2.44.0-SNAPSHOT.jar". If using a full URL, make sure the scheme is specified. If using a local file path, make sure the file exists; you may have to first build the job server using './gradlew runners:spark:3:job-server:shadowJar'.
Is this familiar to anyone?

Abacn · 2022-11-01T19:08:17Z

Seeing a lot of ValueError: Unable to parse jar URL "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Examples_Spark/src/runners/spark/2/job-server/build/libs/beam-runners-spark-3-job-server-2.44.0-SNAPSHOT.jar". If using a full URL, make sure the scheme is specified. If using a local file path, make sure the file exists; you may have to first build the job server using './gradlew runners:spark:3:job-server:shadowJar'. Is this familiar to anyone?

Looks like trying to get spark3 job server from spark2 directory: see spark/2 and spark-3 in the URL.

…#23935, related to apache#23907)

mosche · 2022-11-02T08:19:50Z

Seeing a lot of ValueError: Unable to parse jar URL "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Examples_Spark/src/runners/spark/2/job-server/build/libs/beam-runners-spark-3-job-server-2.44.0-SNAPSHOT.jar". If using a full URL, make sure the scheme is specified. If using a local file path, make sure the file exists; you may have to first build the job server using './gradlew runners:spark:3:job-server:shadowJar'. Is this familiar to anyone?

Sorry @Abacn @ahmedabu98 @chamikaramj, my bad! This slipped in #23751 when migrating to the Spark 3 job-server. Fixed it here: #23936

ahmedabu98 · 2022-11-02T11:09:03Z

Great 👍🏽 that solves the spark issues. Still looking into why those BQ tests are not starting.

ahmedabu98 · 2022-11-02T14:26:37Z

The affected tests have a step that writes to BQ with FILE_LOADS. I've been reproducing locally with other tests and found that pipelines with this write method don't even start on Flink runner. They pass with STREAMING_INSERTS method though.

The errors showing up in Spark examples in @mosche 's fix in #23936 (here) now show the same error:
RuntimeError: Pipeline BeamApp-jenkins-1102122925-c4be4910_cbaf26e4-874a-42dc-adf1-f3a63242d10c failed in state FAILED: java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_52, PCollection=unique_name: "61Write/BigQueryBatchFileLoads/TriggerLoadJobsWithoutTempTables.None" coder_id: "ref_Coder_FastPrimitivesCoder_3" is_bounded: BOUNDED windowing_strategy_id: "ref_Windowing_Windowing_1" }] were consumed but never produced

ahmedabu98 · 2022-11-02T14:49:15Z

Update: this fails when the pipeline writes with FILE_LOADS and is paired with a BigqueryMatcher to verify (example here).

Update#2: it's actually not the BigqueryMatcher, but it's when test args are passed into the Pipeline() instantiation (eg here)
---> ie. telling the pipeline to use FlinkRunner... so nothing too helpful yet.

ahmedabu98 · 2022-11-02T16:29:26Z

The error is caused by this beam.Flatten() operation that merges the results from two transforms. In most cases, one of those transforms, label: TriggerLoadJobsWithoutTempTables, will not be used and so will not have any elements to output. Hence, this transform is "consumed but never produced" by the Flatten operation.

This is not an issue for DirectRunner and DataflowRunner, but is caught by Flink and Spark.

chamikaramj · 2022-11-02T16:43:11Z

Hmm, that's strange. I don't think Flatten requires the input PCollections to be non-empty but there might be an existing Flink/Spark bug here.

ahmedabu98 · 2022-11-02T16:49:40Z

Yeah I'm beginning to doubt my assessment because this was always the setup...

One thing #23012 did change though: the main output of TriggerLoadJobs is now produced by a return statement, previously was by a yield statement.

ahmedabu98 · 2022-11-02T23:21:27Z

#23954 is a workaround. The relevant BQ tests in that PR are passing now, though there are other tests in Flink and Spark example suites that are failing relatively recently (just a day ago).

…#23935, related to apache#23907)

chamikaramj added bug awaiting triage labels Oct 31, 2022

chamikaramj assigned tvalentyn Oct 31, 2022

chamikaramj added this to the 2.43.0 Release milestone Oct 31, 2022

chamikaramj added the P0 label Oct 31, 2022

github-actions bot added examples python and removed awaiting triage labels Oct 31, 2022

chamikaramj assigned ahmedabu98 and unassigned tvalentyn Nov 1, 2022

mosche mentioned this issue Nov 2, 2022

[Bug]: Portable runner tests use invalid path to Spark 3 job-server jar #23935

Closed

mosche pushed a commit to mosche/beam that referenced this issue Nov 2, 2022

Fix Spark 3 job-server jar path for Python test suites (closes apache…

3b5aac5

…#23935, related to apache#23907)

mosche mentioned this issue Nov 2, 2022

Fix Spark 3 job-server jar path for Python test suites #23936

Merged

4 tasks

ahmedabu98 mentioned this issue Nov 2, 2022

Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners #23954

Merged

chamikaramj closed this as completed Nov 4, 2022

github-actions bot modified the milestones: 2.43.0 Release, 2.44.0 Release Nov 4, 2022

ruslan-ikhsan pushed a commit to akvelon/beam that referenced this issue Nov 11, 2022

Fix Spark 3 job-server jar path for Python test suites (closes apache…

365b37f

…#23935, related to apache#23907)

tvalentyn added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Number of Python examples are failing for Flink and Spark on 2.43.0 release branch #23907

[Bug]: Number of Python examples are failing for Flink and Spark on 2.43.0 release branch #23907

chamikaramj commented Oct 31, 2022

chamikaramj commented Oct 31, 2022

tvalentyn commented Oct 31, 2022

Abacn commented Oct 31, 2022

tvalentyn commented Nov 1, 2022

chamikaramj commented Nov 1, 2022

chamikaramj commented Nov 1, 2022

ahmedabu98 commented Nov 1, 2022

Abacn commented Nov 1, 2022

mosche commented Nov 2, 2022 •

edited

Loading

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022 •

edited

Loading

ahmedabu98 commented Nov 2, 2022 •

edited

Loading

ahmedabu98 commented Nov 2, 2022

chamikaramj commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

[Bug]: Number of Python examples are failing for Flink and Spark on 2.43.0 release branch #23907

[Bug]: Number of Python examples are failing for Flink and Spark on 2.43.0 release branch #23907

Comments

chamikaramj commented Oct 31, 2022

What happened?

Issue Priority

Issue Component

chamikaramj commented Oct 31, 2022

tvalentyn commented Oct 31, 2022

Abacn commented Oct 31, 2022

tvalentyn commented Nov 1, 2022

chamikaramj commented Nov 1, 2022

chamikaramj commented Nov 1, 2022

ahmedabu98 commented Nov 1, 2022

Abacn commented Nov 1, 2022

mosche commented Nov 2, 2022 • edited Loading

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022 • edited Loading

ahmedabu98 commented Nov 2, 2022 • edited Loading

ahmedabu98 commented Nov 2, 2022

chamikaramj commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

mosche commented Nov 2, 2022 •

edited

Loading

ahmedabu98 commented Nov 2, 2022 •

edited

Loading

ahmedabu98 commented Nov 2, 2022 •

edited

Loading