Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Number of Python examples are failing for Flink and Spark on 2.43.0 release branch #23907

Closed
chamikaramj opened this issue Oct 31, 2022 · 16 comments
Assignees
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. examples P0 python

Comments

@chamikaramj
Copy link
Contributor

What happened?

Seems like following Python examples are failing for Flink and Spark on 2.43.0 release branch.

  • bigquery_tornadoes
  • hourly_team_score
  • filters_test

For example,
https://ci-beam.apache.org/job/beam_PostCommit_Python_Examples_Spark_PR/14/
https://ci-beam.apache.org/job/beam_PostCommit_Python_Examples_Flink_PR/12/

For bigquery_tornadoes, the error is following.

RuntimeError: Pipeline BeamApp-jenkins-1028002928-4325f8c5_d8258ff0-6727-4335-867f-56bc846d9f3e failed in state FAILED: java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_52, PCollection=unique_name: "61Write/BigQueryBatchFileLoads/TriggerLoadJobsWithoutTempTables.None"
coder_id: "ref_Coder_FastPrimitivesCoder_3"
is_bounded: BOUNDED
windowing_strategy_id: "ref_Windowing_Windowing_1"
}] were consumed but never produced

I found #21300 that probably explain some example failures but this doesn't explain all failures above.

Valentyn, are there any known issues that explain these failures ?

Issue Priority

Priority: 0

Issue Component

Component: examples-python

@chamikaramj
Copy link
Contributor Author

P0 since this is blocking the ongoing Beam release.

@chamikaramj chamikaramj added the P0 label Oct 31, 2022
@tvalentyn
Copy link
Contributor

Did these pass on previous release? If so, errors should be bisectable.

@Abacn
Copy link
Contributor

Abacn commented Oct 31, 2022

Looks like failed tests are all BigQuery tests. Both have been failed after Sept 14th. Possible related changes on the Day of 14th: #23122 and/or #23012

@tvalentyn
Copy link
Contributor

Thanks, @Abacn . The first change affects go sdk only. #23012 looks suspicious. @ahmedabu98 could you please take a look?

@chamikaramj
Copy link
Contributor Author

Confirmed that this:

Passes on 2.42.0 branch
Fails on 2.43.0 branch
Fails on master

Will also try a revert.

@chamikaramj
Copy link
Contributor Author

Fails for commit ac37784

Passes for commit 2d4f61c (one before above).

So the culprit seems to be ac37784

@ahmedabu98 can you please take a look ? We can either do a forward fix in the release branch or revert this if we can revert cleanly.

To reproduce locally, you can run the command below.

./gradlew :sdks:python:test-suites:portable:py37:flinkExamples

@ahmedabu98
Copy link
Contributor

Seeing a lot of ValueError: Unable to parse jar URL "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Examples_Spark/src/runners/spark/2/job-server/build/libs/beam-runners-spark-3-job-server-2.44.0-SNAPSHOT.jar". If using a full URL, make sure the scheme is specified. If using a local file path, make sure the file exists; you may have to first build the job server using './gradlew runners:spark:3:job-server:shadowJar'.
Is this familiar to anyone?

@Abacn
Copy link
Contributor

Abacn commented Nov 1, 2022

Seeing a lot of ValueError: Unable to parse jar URL "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Examples_Spark/src/runners/spark/2/job-server/build/libs/beam-runners-spark-3-job-server-2.44.0-SNAPSHOT.jar". If using a full URL, make sure the scheme is specified. If using a local file path, make sure the file exists; you may have to first build the job server using './gradlew runners:spark:3:job-server:shadowJar'. Is this familiar to anyone?

Looks like trying to get spark3 job server from spark2 directory: see spark/2 and spark-3 in the URL.

@mosche
Copy link
Member

mosche commented Nov 2, 2022

Seeing a lot of ValueError: Unable to parse jar URL "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Examples_Spark/src/runners/spark/2/job-server/build/libs/beam-runners-spark-3-job-server-2.44.0-SNAPSHOT.jar". If using a full URL, make sure the scheme is specified. If using a local file path, make sure the file exists; you may have to first build the job server using './gradlew runners:spark:3:job-server:shadowJar'. Is this familiar to anyone?

Sorry @Abacn @ahmedabu98 @chamikaramj, my bad! This slipped in #23751 when migrating to the Spark 3 job-server. Fixed it here: #23936

@ahmedabu98
Copy link
Contributor

Great 👍🏽 that solves the spark issues. Still looking into why those BQ tests are not starting.

@ahmedabu98
Copy link
Contributor

ahmedabu98 commented Nov 2, 2022

The affected tests have a step that writes to BQ with FILE_LOADS. I've been reproducing locally with other tests and found that pipelines with this write method don't even start on Flink runner. They pass with STREAMING_INSERTS method though.

The errors showing up in Spark examples in @mosche 's fix in #23936 (here) now show the same error:
RuntimeError: Pipeline BeamApp-jenkins-1102122925-c4be4910_cbaf26e4-874a-42dc-adf1-f3a63242d10c failed in state FAILED: java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_52, PCollection=unique_name: "61Write/BigQueryBatchFileLoads/TriggerLoadJobsWithoutTempTables.None" coder_id: "ref_Coder_FastPrimitivesCoder_3" is_bounded: BOUNDED windowing_strategy_id: "ref_Windowing_Windowing_1" }] were consumed but never produced

@ahmedabu98
Copy link
Contributor

ahmedabu98 commented Nov 2, 2022

Update: this fails when the pipeline writes with FILE_LOADS and is paired with a BigqueryMatcher to verify (example here).

Update#2: it's actually not the BigqueryMatcher, but it's when test args are passed into the Pipeline() instantiation (eg here)
---> ie. telling the pipeline to use FlinkRunner... so nothing too helpful yet.

@ahmedabu98
Copy link
Contributor

The error is caused by this beam.Flatten() operation that merges the results from two transforms. In most cases, one of those transforms, label: TriggerLoadJobsWithoutTempTables, will not be used and so will not have any elements to output. Hence, this transform is "consumed but never produced" by the Flatten operation.

This is not an issue for DirectRunner and DataflowRunner, but is caught by Flink and Spark.

@chamikaramj
Copy link
Contributor Author

Hmm, that's strange. I don't think Flatten requires the input PCollections to be non-empty but there might be an existing Flink/Spark bug here.

@ahmedabu98
Copy link
Contributor

Yeah I'm beginning to doubt my assessment because this was always the setup...

One thing #23012 did change though: the main output of TriggerLoadJobs is now produced by a return statement, previously was by a yield statement.

@ahmedabu98
Copy link
Contributor

#23954 is a workaround. The relevant BQ tests in that PR are passing now, though there are other tests in Flink and Spark example suites that are failing relatively recently (just a day ago).

@tvalentyn tvalentyn added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. examples P0 python
Projects
None yet
Development

No branches or pull requests

5 participants