-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark Driver Pod Getting Stuck in init state due to no driver configmap found #1574
Comments
I seem to hit these when there are too many old spark applications in my cluster. If we keep them below a couple thousand we seem to be ok. |
I have the same question. |
same too |
We identified the bottleneck of this performance issue is to do with spark-submit functionality is in JVM based scala code. We have ported the spark-submit functionality in golang, and see 4X performance gains. Code base is in master...gangahiremath:spark-on-k8s-operator:master |
@liyinan926 Any thoughts about this ^? We've shared the same information in the k8s slack but didn't get any reply from you yet. I'm porting the messages to here for visibility. Please check and share how you think about this effort, and we can discuss about how we can contribute this back to OSS. Thanks!
|
Hi, we also got an issue with spark operator slowness , but we found that the real reason for the slowness is scale is the controller delayed queue mutex. We saw that even with controller threads up to 10000, and a very large machine, the CPU usage was under 5% percent, due to the fact that all threads are using one single mutex, even though it does not seem necessary. We've decided to create a branch with a very different approach - queue per spark application, so all events of the same always happen in order, but different apps are not blocking each other . This is our branch:
|
@bnetzi , I do not see momentum on the proposed PR to get merged with required documentation and unit test coverage added. Please share thoughts on the future plan for this work. |
So, I actually I presented this PR in the last spark operator community meeting and it seems like it's on a roll. However, as it is a lot of new code, it would take some time for it to get in. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
Hello,
Spark Application driver pod get stuck in init state due to driver configmap not found issue; what could be possible reason for the same and this results in Spark Operator stopping as process id (mentioned below) responsible for SparkSubmit also gets stuck. This is not consistent, encountered by sample Pi Spark application that has run successfully on number of times in the same environment.
sparkxpix75ba3dfb-driver 0/1 Init:0/1 0 47m
Warning FailedMount 19s (x20 over 31m) kubelet (combined from similar events): MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "sparkxpix75ba3dfb-1657310554583-driver-conf-map" not found
Process ID invoked by SparkSubmit is also stuck
/opt/tools/Linux/jdk/openjdk1.8.0.332_8.62.0.20_x64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* org.apache.spark.deploy.SparkSubmit --master k8s://https://1.2.3.1:443 --deploy-mode cluster --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent --conf spark.executor.memory=512m --conf spark.driver.memory=512m --conf spark.network.crypto.enabled=true --conf spark.driver.cores=0.100000 --conf spark.io.encryption.enabled=true --conf spark.kubernetes.driver.limit.cores=200m --conf spark.kubernetes.driver.label.version=3.0.1 --conf spark.app.name=sparkxpix75ba3dfb --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.executor.cores=1 --conf spark.authenticate=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.namespace=abc-watch --conf spark.kubernetes.container.image==test:1 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=e7075bf4-c30d-4d53-b924-0d2011555ce1 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=sparkxpix75ba3dfb --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=e7075bf4-c30d-4d53-b924-0d2011555ce1 --conf spark.kubernetes.driver.pod.name=sparkxpix75ba3dfb-driver --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-driver-abc-watch --conf spark.executor.instances=1 --conf spark.kubernetes.executor.label.version=3.0.1 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=sparkxpix75ba3dfb --class org.apache.spark.examples.SparkPi --jars local:///sample-apps/sample-basic-spark-operator/extra-jars/* local:///sample-apps/sample-basic-spark-operator/sample-basic-spark-operator.jar
These process ids blocked increase overtime and Spark Operator stops processing new SparkApplications.
The text was updated successfully, but these errors were encountered: