Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark Driver Pod Getting Stuck in init state due to no driver configmap found #1574

Closed
gangahiremath opened this issue Jul 8, 2022 · 10 comments

Comments

@gangahiremath
Copy link

gangahiremath commented Jul 8, 2022

Hello,

Spark Application driver pod get stuck in init state due to driver configmap not found issue; what could be possible reason for the same and this results in Spark Operator stopping as process id (mentioned below) responsible for SparkSubmit also gets stuck. This is not consistent, encountered by sample Pi Spark application that has run successfully on number of times in the same environment.

sparkxpix75ba3dfb-driver 0/1 Init:0/1 0 47m

Warning FailedMount 19s (x20 over 31m) kubelet (combined from similar events): MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "sparkxpix75ba3dfb-1657310554583-driver-conf-map" not found

Process ID invoked by SparkSubmit is also stuck

/opt/tools/Linux/jdk/openjdk1.8.0.332_8.62.0.20_x64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* org.apache.spark.deploy.SparkSubmit --master k8s://https://1.2.3.1:443 --deploy-mode cluster --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent --conf spark.executor.memory=512m --conf spark.driver.memory=512m --conf spark.network.crypto.enabled=true --conf spark.driver.cores=0.100000 --conf spark.io.encryption.enabled=true --conf spark.kubernetes.driver.limit.cores=200m --conf spark.kubernetes.driver.label.version=3.0.1 --conf spark.app.name=sparkxpix75ba3dfb --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.executor.cores=1 --conf spark.authenticate=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.namespace=abc-watch --conf spark.kubernetes.container.image==test:1 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=e7075bf4-c30d-4d53-b924-0d2011555ce1 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=sparkxpix75ba3dfb --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=e7075bf4-c30d-4d53-b924-0d2011555ce1 --conf spark.kubernetes.driver.pod.name=sparkxpix75ba3dfb-driver --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-driver-abc-watch --conf spark.executor.instances=1 --conf spark.kubernetes.executor.label.version=3.0.1 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=sparkxpix75ba3dfb --class org.apache.spark.examples.SparkPi --jars local:///sample-apps/sample-basic-spark-operator/extra-jars/* local:///sample-apps/sample-basic-spark-operator/sample-basic-spark-operator.jar

These process ids blocked increase overtime and Spark Operator stops processing new SparkApplications.

@jdonnelly-apixio
Copy link

I seem to hit these when there are too many old spark applications in my cluster. If we keep them below a couple thousand we seem to be ok.

@jiamin13579
Copy link

I have the same question.

@JasonRD
Copy link

JasonRD commented Feb 28, 2023

same too

@gangahiremath
Copy link
Author

We identified the bottleneck of this performance issue is to do with spark-submit functionality is in JVM based scala code. We have ported the spark-submit functionality in golang, and see 4X performance gains. Code base is in master...gangahiremath:spark-on-k8s-operator:master

@huskysun
Copy link
Contributor

@liyinan926 Any thoughts about this ^? We've shared the same information in the k8s slack but didn't get any reply from you yet. I'm porting the messages to here for visibility. Please check and share how you think about this effort, and we can discuss about how we can contribute this back to OSS. Thanks!

Ganga H 12:12 PM
Hi [@liyinan926]
@Shiqi Sun and I work in a team in Salesforce. We provide managed Spark service(Spark and Spark Operator on K8S) to internal customers. Our customers faced issue with Spark Operator performance. At the max, it is able to process 80 Spark Application submission in a minute. We did deep-dive of Apache Spark and Spark Operator code base, and understood the bottleneck is spark-submit step. We have ported and re-written spark-submit within Spark Operator golang code base. With this, we see performance improvement of 4-5X (300 to 380 Spark Application submissions in a minute). We wanted to contribute this back to open source in a way both Spark and Spark Operator communities are open to it and embrace it as a default mechanism. Please guide how to go about the same. We are happy to share additional information, code base regarding the same. Thank you.
@Shiqi Sun, please add/correct.

Shiqi Sun 12:59 PM
@liyinan926 The performance was a problem to us since one of our customers submits a lot of batch jobs in tens of namespaces to our cluster, and with the existing Spark Operator performance (i.e. 80 SparkApps per min) we couldn’t keep up with their peak request, and that bumps up the Spark Operator internal queue a lot, which caused both the latency issue and a lot of job failures. We couldn’t resolve the issue by scaling up the Spark Operator, both vertically (giving Spark Operator pods more cpu/memory/controller-threads) and horizontally (spinning up more Spark Operator pods, since there is only one leader pod doing the work anyways). The bottleneck we found was in the spark-submit, where the spin-up of the JVM takes lots of resources (cpu/mem) and time, which significantly impacted the throughput. All this heavy JVM does is to run the cluster manager process in the Spark code to spin up driver pod and service objects, which doesn’t make much sense to us since we shouldn’t have needed a JVM to do that - golang is the cloud-native way of doing that and why not just let spark operator do that using golang? Therefore, @Ganga H worked on rewriting the driver pod/svc creation code into golang in the Spark Operator (vs. the existing Scala code in Spark source code), and we gained a lot of performance gain as Ganga mentioned above. This also decoupled the Spark Operator from including the Spark inside it, which alleviated the maintenance pain of different version of Spark and Spark Operator. However, one issue we can see of this golang code rewriting is that, we constantly need to translate the feature added to the Spark Scala code into golang. We have some ideas of how to solve this problem, and we can discuss more about it. Anyway, we benefitted a lot from this effort that @Ganga H did, we are thinking about contributing this back to the OSS Spark Operator project so that it can help solve the scalability issue when other Spark or Spark Operator users encounter it. But since this is relatively a big change, we wanted to provide the context and talk with you about the path forward before we make the PR. Let us know if you have any questions or concerns, thanks!

@bnetzi
Copy link

bnetzi commented Mar 31, 2024

Hi, we also got an issue with spark operator slowness , but we found that the real reason for the slowness is scale is the controller delayed queue mutex.

We saw that even with controller threads up to 10000, and a very large machine, the CPU usage was under 5% percent, due to the fact that all threads are using one single mutex, even though it does not seem necessary.

We've decided to create a branch with a very different approach - queue per spark application, so all events of the same always happen in order, but different apps are not blocking each other .
We've tested it in large scale (creating up to 1000 apps at once) and got to a latency of less than 20 seconds at most (and our 96 cores Instance got to 100% CPU usage)

This is our branch:
https://github.com/kubeflow/spark-operator/compare/master...bnetzi:spark-operator:master?expand=1

We identified the bottleneck of this performance issue is to do with spark-submit functionality is in JVM based scala code. We have ported the spark-submit functionality in golang, and see 4X performance gains. Code base is in master...gangahiremath:spark-on-k8s-operator:master

@gangahiremath
Copy link
Author

@bnetzi , I do not see momentum on the proposed PR to get merged with required documentation and unit test coverage added. Please share thoughts on the future plan for this work.

@bnetzi
Copy link

bnetzi commented Jun 16, 2024

@bnetzi , I do not see momentum on the proposed PR to get merged with required documentation and unit test coverage added. Please share thoughts on the future plan for this work.

So, I actually I presented this PR in the last spark operator community meeting and it seems like it's on a roll. However, as it is a lot of new code, it would take some time for it to get in.
In the meantime you can try it on your env, it runs in ours for a while now without issues

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

github-actions bot commented Oct 5, 2024

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants