Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark Dataset runner] Fix SparkSessionFactory to better support running on a cluster. #24862

Merged
merged 4 commits into from
Jan 16, 2023

Conversation

mosche
Copy link
Member

@mosche mosche commented Jan 3, 2023

The most common way to submit jobs to a Spark cluster is to use spark-submit. SparkSessionFactory doesn't handle that well. Despite spark.master being set by spark-submit, it's overwritten by the respective PipelineOption which defaults to local[*].

  • Fix factory to not overwrite spark.master if configured already, use the effective Spark master going forward.

Staging of classpath artifacts is necessary when running on a cluster using a local driver. This is required to populate spark.jars as it would be done by spark-submit otherwise. Unfortunately this is broken as staging is done after the session was already created.

  • Enable userClassPathFirst to deal with conflicting dependency versions. For Spark & Beam that's usually the case for Jackson and Guava, but potentially also others.

  • Correctly stage classpath if not in local mode and if spark.jars is not set. Exclude Spark jars and similar that are already available on the cluster and that cause conflicts if enabling userClassPathFirst.

(fixes #24861).


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@mosche
Copy link
Member Author

mosche commented Jan 3, 2023

Run Spark StructuredStreaming ValidatesRunner

@mosche
Copy link
Member Author

mosche commented Jan 3, 2023

R: @aromanenko-dev
R: @echauchot

@github-actions
Copy link
Contributor

github-actions bot commented Jan 3, 2023

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@echauchot echauchot self-requested a review January 9, 2023 15:04
@echauchot
Copy link
Contributor

@mosche reviewing ...

Copy link
Contributor

@echauchot echauchot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your PR Moiritz.

@mosche
Copy link
Member Author

mosche commented Jan 11, 2023

Thanks for the review @echauchot , I've pushed the null check on appName

@echauchot
Copy link
Contributor

Run Spark ValidatesRunner

Copy link
Contributor

@echauchot echauchot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM, merging

@echauchot echauchot merged commit 1aa5acc into apache:master Jan 16, 2023
@aromanenko-dev aromanenko-dev mentioned this pull request Jan 17, 2023
3 tasks
@mosche mosche deleted the 24861_spark_ds_fix_sessionfactory branch May 11, 2023 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: SparkSession factory in Spark Dataset runner is difficult to use with production like use cases.
3 participants