-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark Dataset runner] Fix SparkSessionFactory to better support running on a cluster. #24862
[Spark Dataset runner] Fix SparkSessionFactory to better support running on a cluster. #24862
Conversation
…ing on a cluster (fixes apache#24861).
Run Spark StructuredStreaming ValidatesRunner |
R: @aromanenko-dev |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
@mosche reviewing ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your PR Moiritz.
.../java/org/apache/beam/runners/spark/structuredstreaming/translation/SparkSessionFactory.java
Show resolved
Hide resolved
.../java/org/apache/beam/runners/spark/structuredstreaming/translation/SparkSessionFactory.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/beam/runners/spark/structuredstreaming/translation/SparkSessionFactory.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/beam/runners/spark/structuredstreaming/translation/SparkSessionFactory.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/beam/runners/spark/structuredstreaming/translation/SparkSessionFactory.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/beam/runners/spark/structuredstreaming/translation/SparkSessionFactory.java
Show resolved
Hide resolved
Thanks for the review @echauchot , I've pushed the null check on |
Run Spark ValidatesRunner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks LGTM, merging
The most common way to submit jobs to a Spark cluster is to use
spark-submit
.SparkSessionFactory
doesn't handle that well. Despitespark.master
being set byspark-submit
, it's overwritten by the respectivePipelineOption
which defaults tolocal[*]
.spark.master
if configured already, use the effective Spark master going forward.Staging of classpath artifacts is necessary when running on a cluster using a local driver. This is required to populate
spark.jars
as it would be done byspark-submit
otherwise. Unfortunately this is broken as staging is done after the session was already created.Enable
userClassPathFirst
to deal with conflicting dependency versions. For Spark & Beam that's usually the case for Jackson and Guava, but potentially also others.Correctly stage classpath if not in local mode and if
spark.jars
is not set. Exclude Spark jars and similar that are already available on the cluster and that cause conflicts if enablinguserClassPathFirst
.(fixes #24861).
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.