Spark Driver Pod Getting Stuck in init state due to no driver configmap found #1574

gangahiremath · 2022-07-08T20:59:03Z

Hello,

Spark Application driver pod get stuck in init state due to driver configmap not found issue; what could be possible reason for the same and this results in Spark Operator stopping as process id (mentioned below) responsible for SparkSubmit also gets stuck. This is not consistent, encountered by sample Pi Spark application that has run successfully on number of times in the same environment.

sparkxpix75ba3dfb-driver 0/1 Init:0/1 0 47m

Warning FailedMount 19s (x20 over 31m) kubelet (combined from similar events): MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "sparkxpix75ba3dfb-1657310554583-driver-conf-map" not found

Process ID invoked by SparkSubmit is also stuck

/opt/tools/Linux/jdk/openjdk1.8.0.332_8.62.0.20_x64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* org.apache.spark.deploy.SparkSubmit --master k8s://https://1.2.3.1:443 --deploy-mode cluster --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent --conf spark.executor.memory=512m --conf spark.driver.memory=512m --conf spark.network.crypto.enabled=true --conf spark.driver.cores=0.100000 --conf spark.io.encryption.enabled=true --conf spark.kubernetes.driver.limit.cores=200m --conf spark.kubernetes.driver.label.version=3.0.1 --conf spark.app.name=sparkxpix75ba3dfb --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.executor.cores=1 --conf spark.authenticate=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.namespace=abc-watch --conf spark.kubernetes.container.image==test:1 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=e7075bf4-c30d-4d53-b924-0d2011555ce1 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=sparkxpix75ba3dfb --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=e7075bf4-c30d-4d53-b924-0d2011555ce1 --conf spark.kubernetes.driver.pod.name=sparkxpix75ba3dfb-driver --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-driver-abc-watch --conf spark.executor.instances=1 --conf spark.kubernetes.executor.label.version=3.0.1 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=sparkxpix75ba3dfb --class org.apache.spark.examples.SparkPi --jars local:///sample-apps/sample-basic-spark-operator/extra-jars/* local:///sample-apps/sample-basic-spark-operator/sample-basic-spark-operator.jar

These process ids blocked increase overtime and Spark Operator stops processing new SparkApplications.

jdonnelly-apixio · 2022-08-11T20:04:36Z

I seem to hit these when there are too many old spark applications in my cluster. If we keep them below a couple thousand we seem to be ok.

jiamin13579 · 2022-10-17T05:26:06Z

I have the same question.

JasonRD · 2023-02-28T09:11:55Z

same too

gangahiremath · 2023-08-30T18:44:21Z

We identified the bottleneck of this performance issue is to do with spark-submit functionality is in JVM based scala code. We have ported the spark-submit functionality in golang, and see 4X performance gains. Code base is in master...gangahiremath:spark-on-k8s-operator:master

huskysun · 2023-08-30T23:24:33Z

@liyinan926 Any thoughts about this ^? We've shared the same information in the k8s slack but didn't get any reply from you yet. I'm porting the messages to here for visibility. Please check and share how you think about this effort, and we can discuss about how we can contribute this back to OSS. Thanks!

Ganga H 12:12 PM
Hi [@liyinan926]
@Shiqi Sun and I work in a team in Salesforce. We provide managed Spark service(Spark and Spark Operator on K8S) to internal customers. Our customers faced issue with Spark Operator performance. At the max, it is able to process 80 Spark Application submission in a minute. We did deep-dive of Apache Spark and Spark Operator code base, and understood the bottleneck is spark-submit step. We have ported and re-written spark-submit within Spark Operator golang code base. With this, we see performance improvement of 4-5X (300 to 380 Spark Application submissions in a minute). We wanted to contribute this back to open source in a way both Spark and Spark Operator communities are open to it and embrace it as a default mechanism. Please guide how to go about the same. We are happy to share additional information, code base regarding the same. Thank you.
@Shiqi Sun, please add/correct.

Shiqi Sun 12:59 PM
@liyinan926 The performance was a problem to us since one of our customers submits a lot of batch jobs in tens of namespaces to our cluster, and with the existing Spark Operator performance (i.e. 80 SparkApps per min) we couldn’t keep up with their peak request, and that bumps up the Spark Operator internal queue a lot, which caused both the latency issue and a lot of job failures. We couldn’t resolve the issue by scaling up the Spark Operator, both vertically (giving Spark Operator pods more cpu/memory/controller-threads) and horizontally (spinning up more Spark Operator pods, since there is only one leader pod doing the work anyways). The bottleneck we found was in the spark-submit, where the spin-up of the JVM takes lots of resources (cpu/mem) and time, which significantly impacted the throughput. All this heavy JVM does is to run the cluster manager process in the Spark code to spin up driver pod and service objects, which doesn’t make much sense to us since we shouldn’t have needed a JVM to do that - golang is the cloud-native way of doing that and why not just let spark operator do that using golang? Therefore, @Ganga H worked on rewriting the driver pod/svc creation code into golang in the Spark Operator (vs. the existing Scala code in Spark source code), and we gained a lot of performance gain as Ganga mentioned above. This also decoupled the Spark Operator from including the Spark inside it, which alleviated the maintenance pain of different version of Spark and Spark Operator. However, one issue we can see of this golang code rewriting is that, we constantly need to translate the feature added to the Spark Scala code into golang. We have some ideas of how to solve this problem, and we can discuss more about it. Anyway, we benefitted a lot from this effort that @Ganga H did, we are thinking about contributing this back to the OSS Spark Operator project so that it can help solve the scalability issue when other Spark or Spark Operator users encounter it. But since this is relatively a big change, we wanted to provide the context and talk with you about the path forward before we make the PR. Let us know if you have any questions or concerns, thanks!

bnetzi · 2024-03-31T15:14:26Z

Hi, we also got an issue with spark operator slowness , but we found that the real reason for the slowness is scale is the controller delayed queue mutex.

We saw that even with controller threads up to 10000, and a very large machine, the CPU usage was under 5% percent, due to the fact that all threads are using one single mutex, even though it does not seem necessary.

We've decided to create a branch with a very different approach - queue per spark application, so all events of the same always happen in order, but different apps are not blocking each other .
We've tested it in large scale (creating up to 1000 apps at once) and got to a latency of less than 20 seconds at most (and our 96 cores Instance got to 100% CPU usage)

This is our branch:
https://github.com/kubeflow/spark-operator/compare/master...bnetzi:spark-operator:master?expand=1

We identified the bottleneck of this performance issue is to do with spark-submit functionality is in JVM based scala code. We have ported the spark-submit functionality in golang, and see 4X performance gains. Code base is in master...gangahiremath:spark-on-k8s-operator:master

gangahiremath · 2024-06-16T20:07:34Z

@bnetzi , I do not see momentum on the proposed PR to get merged with required documentation and unit test coverage added. Please share thoughts on the future plan for this work.

bnetzi · 2024-06-16T20:19:57Z

@bnetzi , I do not see momentum on the proposed PR to get merged with required documentation and unit test coverage added. Please share thoughts on the future plan for this work.

So, I actually I presented this PR in the last spark operator community meeting and it seems like it's on a roll. However, as it is a lot of new code, it would take some time for it to get in.
In the meantime you can try it on your env, it runs in ours for a while now without issues

github-actions · 2024-09-14T22:04:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-10-05T00:18:45Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

bnetzi mentioned this issue Apr 17, 2024

Draft: Performance mega boost - queue per app #1990

Open

1 task

github-actions bot added the lifecycle/stale label Sep 14, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 5, 2024

gangahiremath mentioned this issue Oct 11, 2024

Spark Operator Roadmap 2024 #2193

Open

8 tasks

c-h-afzal mentioned this issue Nov 26, 2024

Enable the Spark Operator to launch applications using user-defined mechanisms beyond the default spark-submit #2337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Driver Pod Getting Stuck in init state due to no driver configmap found #1574

Spark Driver Pod Getting Stuck in init state due to no driver configmap found #1574

gangahiremath commented Jul 8, 2022 •

edited

Loading

jdonnelly-apixio commented Aug 11, 2022

jiamin13579 commented Oct 17, 2022

JasonRD commented Feb 28, 2023

gangahiremath commented Aug 30, 2023

huskysun commented Aug 30, 2023

bnetzi commented Mar 31, 2024 •

edited

Loading

gangahiremath commented Jun 16, 2024

bnetzi commented Jun 16, 2024

github-actions bot commented Sep 14, 2024

github-actions bot commented Oct 5, 2024

Spark Driver Pod Getting Stuck in init state due to no driver configmap found #1574

Spark Driver Pod Getting Stuck in init state due to no driver configmap found #1574

Comments

gangahiremath commented Jul 8, 2022 • edited Loading

jdonnelly-apixio commented Aug 11, 2022

jiamin13579 commented Oct 17, 2022

JasonRD commented Feb 28, 2023

gangahiremath commented Aug 30, 2023

huskysun commented Aug 30, 2023

bnetzi commented Mar 31, 2024 • edited Loading

gangahiremath commented Jun 16, 2024

bnetzi commented Jun 16, 2024

github-actions bot commented Sep 14, 2024

github-actions bot commented Oct 5, 2024

gangahiremath commented Jul 8, 2022 •

edited

Loading

bnetzi commented Mar 31, 2024 •

edited

Loading