-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.io.InvalidClassException with Spark 3.1.2 #21092
Comments
I can confirm that I'm facing the same issue using a job services based on apache/beam_spark3_job_server:2.41.0 trying to run a beam pipeline on a bitnami/spark:3.1 based spark cluster. stderror log from the task:
Based on my understanding, this issue is usually linked to having different versions of spark between the version deployed on the cluster and the client submitting the job into the cluster. In this case, I'm guessing the client is the job service. I can also confirm, after running a quick test, downloading the same spark sdk (as the cluster version) locally and submitting a job from a local client to the cluster via something like:
Works fine. I don't know what goes on inside the beam spark job server but this leads me to thinking, that this is probably an incompatibility between what the spark client version is on the beam job server and what the target cluster is running. Any idea if there is an easy work around, around this? Is there any way to know the exact version of Spark that the Would greatly appreciate any feedback on this. Thanks! |
@aymanfarhat Just FYI, this is a known Scala issue. You can find more details on the problem here including hints on how to resolve it. |
From the link you provided, it appears that Beam is using scala v2.12 and when we build beam it is being built using scala2.12.14 which is incompatible with Spark v3.1.2, How do we set the scala version for beam's job server? |
@aromanenko-dev I'm not really familiar with the portable runner / the job server, would you have additional insights here? As far as I can see, both beam-spark and beam-spark-job-server are using Scala 2.12.15. An option might be the bump Beam to use Spark 3.2.2 or later (with Scala 2.12.15). |
@mosche AFAIK, there is no specific Scala dependencies for Spark portable runner and job server. They are sitting in the same package branch and have the same Spark/Scala dependencies as a native runner. I'm sure that @ibzib should know much better than me. Also, I'm wondering why Scala version was specifically set to 2.12.15 here since by default it is that way for Spark 3:
Indeed, the right solution would be align the spark/scala versions. The question here is do we need to support the different versions or not? |
I tried setting scala to 2.12.10 in spark_runner.gradle; it did not work. Do I have to set scala version somewhere else? Also when I use gradlew :runners:spark:3:job-server:runShadow command to build and run the job server, I see that a jar file over 200mb, named beam-runners-spark-3-job-server-2.41.0-SNAPSHOT.jar, is cretead under runner/spark/3/job-server/build/libs folder. I opened that jar in 7zip to look at the content, the file scala-xml.properties says that the scala version is 2.12.8 whereas I had explicitly set the scala version to 2.12.10. Why is that happening. |
Also what is the working and tested spark job server version and it's compatible Spark version. If I cannot use beam with Spark 3.1.2; the I will have to downgrade the Spark version. |
@nitinlkoin1984 I finally found some time to look deeper into this. Sorry for the hassle, finding the job-server in this state is a bit disappointing.
Unfortunately this is a weakness of the existing test infrastructure, it uses Spark in local mode. In that setup such a classpath issue won't be discovered. Anyways, I've done some testing:
Alternatively you could build yourself a custom Spark 3.1.2 image that contains Scala 2.12.15 (instead of 2.12.10) on the classpath. But I don't think that's generally a feasible option. Let me know if any of these options help! @aromanenko-dev The 2nd option would fix the job-server without having to bump Spark. One the other hand bumping to Spark 3.2.2 seems to be the more robust and longterm solution. But the biggest concern there is the Avro dependency upgrade (1.10). What do you think? Anyone else who could chime in? |
This was reported on the mailing list.
----
Using spark downloaded from below link,
https://www.apache.org/dyn/closer.lua/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
I get below error when submitting a pipeline.
Full error is on https://gist.github.com/yuwtennis/7b0c1dc0dcf98297af1e3179852ca693.
------------------------------------------------------------------------------------------------------------------
21/08/16 01:10:26 WARN TransportChannelHandler: Exception in connection from /192.168.11.2:35601
java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible: stream classdesc serialVersionUID = 3456489343829468865, local class serialVersionUID = 1028182004549731694
at java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
...
------------------------------------------------------------------------------------------------------------------
SDK Harness and Job service are deployed as below.
sudo docker run --net=host apache/beam_spark3_job_server:2.31.0 --spark-master-url=spark://localhost:7077 --clean-artifacts-per-job true
sudo docker run --net=host apache/beam_python3.8_sdk:2.31.0 --worker_pool
https://gist.github.com/yuwtennis/2e4c13c79f71e8f713e947955115b3e2
Spark 2.4.8 succeeded without any errors using above components.
https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
Imported from Jira BEAM-12762. Original Jira may contain additional context.
Reported by: ibzib.
This issue has child subcomponents which were not migrated over. See the original Jira for more information.
The text was updated successfully, but these errors were encountered: