-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 Bug]: Session Times Out #2093
Comments
@Staicul, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
I see the timeout, but how can we reproduce the issue? |
Yes, that's a question I've asked myself and don't have the answer to yet. Can you advise me on how to get more details?
Also, do you maybe have something valuable on this? |
It is hard to say because it is about debugging your tests. I would identify which ones usually fail and where they fail. Also, run them locally and see if that also happens. |
I am pretty sure it's not related to the tests. They used to work without problems with older versions and it's not always the same ones failing... Will try to come up with more detail... |
I increased the logging level and trying to make sense of the logs (300MB for 30 min). There are a few of these
|
Similar issue here, also willing to help with investigation, as right now I'm a bit lost and out of ideas. For us the issue is quite random, like we have 1 job hanging in Jenkins every 2 weeks or so. Contrary to the OP, we always have only one session timed out hanging our job. We run +10 jobs per day, usually with 50/65 parallel nodes. It is being hard to debug as the issue happens very randomly and we don't have a pattern yet. We're currently using the latest 4.18.0-20240220, though we saw the same with the previous 2/3 versions. I think first time we saw this was in December and we were probably using Selenium 4.16 and Chrome 120. |
I also tried to set up CI tests to verify scenario parallel execution, both docker-compose, and autoscaling on Kubernetes.
The exception from binding selenium.common.exceptions.SessionNotCreatedException: Message: Could not start a new session. Could not start a new session. Unable to create new session
Host info: host: 'selenium-distributor-6fc9845c64-8dg9c', ip: '10.244.0.145'
Build info: version: '4.18.1', revision: 'b1d3319b48'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-167-generic', java.version: '11.0.21'
Driver info: driver.version: unknown
Stacktrace:
at org.openqa.selenium.grid.sessionqueue.SessionNotCreated.execute (SessionNotCreated.java:56)
at org.openqa.selenium.remote.http.Route$TemplatizedRoute.handle (Route.java:192)
at org.openqa.selenium.remote.http.Route.execute (Route.java:69)
at org.openqa.selenium.grid.security.RequiresSecretFilter.lambda$apply$0 (RequiresSecretFilter.java:62)
at org.openqa.selenium.remote.http.Filter$1.execute (Filter.java:63)
at org.openqa.selenium.remote.http.Route$CombinedRoute.handle (Route.java:346)
at org.openqa.selenium.remote.http.Route.execute (Route.java:69)
at org.openqa.selenium.grid.sessionqueue.NewSessionQueue.execute (NewSessionQueue.java:128)
at org.openqa.selenium.remote.http.Route$CombinedRoute.handle (Route.java:346)
at org.openqa.selenium.remote.http.Route.execute (Route.java:69)
at org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0 (AddWebDriverSpecHeaders.java:35)
at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0 (ErrorFilter.java:44)
at org.openqa.selenium.remote.http.Filter$1.execute (Filter.java:63)
at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0 (ErrorFilter.java:44)
at org.openqa.selenium.remote.http.Filter$1.execute (Filter.java:63)
at org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0 (SeleniumHandler.java:44)
at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
at java.util.concurrent.FutureTask.run (FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
at java.lang.Thread.run (Thread.java:829) From tracing view (Jaeger) Looks like there was an internal error, the session could not be created in the first attempt. Distributor retry to add it back the front of queue, but at that time no slot available After few retires, looks like no slot available to pick it up, distributor will give up and throw exception that session could not create At this point, I don't know how the retry mechanism works, e.g how many times it retries or if it will retry in the period same as the timeout set for I also looked at pod logs of the distributor, based on the request id and timestamp, from the first retry found to the ending request, it's around 17 seconds 04:40:13.934 DEBUG [LocalDistributor.reserveSlot] - No slots found for request 9a0f0183-45e6-4dd1-959c-b5049efcb1b7 and capabilities Capabilities {acceptInsecureCerts: true, browserName: firefox, moz:debuggerAddress: true, moz:firefoxOptions: {profile: UEsDBBQAAAAIAA1RXFiYyAuilAM...}, pageLoadStrategy: normal, se:downloadsEnabled: true, se:recordVideo: true}
04:40:13.935 INFO [LocalDistributor.newSession] - Unable to find a free slot for request 9a0f0183-45e6-4dd1-959c-b5049efcb1b7.
...
04:40:30.732 DEBUG [HttpClientImpl$SelectorManager.run] - [HttpClient-2-SelectorManager] [163s 402ms] HttpClientImpl(2) next timeout: 0
04:40:30.732 DEBUG [JdkHttpClient.execute0] - Ending request (POST) /se/grid/newsessionqueue/session/9a0f0183-45e6-4dd1-959c-b5049efcb1b7/failure in 6ms
04:40:30.732 DEBUG [HttpClientImpl$SelectorManager.run] - [HttpClient-2-SelectorManager] [163s 403ms] HttpClientImpl(2) next expired: 1199268
04:40:30.732 DEBUG [HttpClientImpl$SelectorManager.run] - [HttpClient-2-SelectorManager] [163s 403ms] HttpClientImpl(2) Next deadline is 3000 I guess the situation could be something like whenever the distributor retries on a request, if there is no slot available to pick request after a few retry attempts, the request will be failed Especially in autoscaling on Kubernetes, the Scaler is listening to number of the queue requests, I guess whenever the distributor says "Retry adding to front of queue", the number of queues is not updated accordingly, and the Scaler will not ack to scale up a new Node to serve retry request ASAP. I think something can be fixed
@diemol, can you read through my comment and advise? Thanks! |
OK, so this seems to happen in the autoscaling scenario where Nodes are not available quickly enough, and then What if you give a longer timeout for that scenario? |
In that scenario, I can confirm |
The issue in this use case is that the Distributor removes the session request from the queue and then tries to create the session. Just to let you know, the distributor takes session requests from the queue when they are available on the nodes. The problem is, I guess, autoscaling uses the queue to boot more pods, and then it sees nothing on the queue and scales down. |
FYI I have seen this problem above in my test pipeline, and it randomly happens... I have noticed also that when it happens once, no matter how many tests I try to run after that, they will always end up with session could not be created exception. It only works, and not always, when I delete and install helm chart again. For ref: #2129 (comment) |
@VietND96 @diemol just a quick update after we upgraded our images to 4.18 and chart 0.28.x I have noticed that when "Session cloud not be created" problem happens... when we restart keda-operator and selenium-hub (not clear which one did the magic), problem is fixed and nodes can be created again and sessions assigned. With that in mind, I have implemented in my test pipeline a retry mechanism and a restart of hub/keda... which basically addressed the issue (as even after we restart and have some succeeded runs, problem randomly appear again and a new restart is necessary). Big workaround of course, while we don't find the root cause for that. What really cause some trouble to us is when chrome nodes already used do not terminate even after processes get SIGTERM. That will pile the nodes till a new one cannot be created due to insufficient resources - and then Session Could not be created comes again, now with a cause that makes sense. Commented quickly about that to @VietND96 here, and now that we had upgraded the images we need to obtain the logs of pre-stop as suggested in that message. |
Hi all, we had the same issue today again, one of our jobs was hanging with 1 test still supposed to be running after 4 hours. In node logs I can see there was some timeout, causing the faulty session to be stopped:
As per hub logs, it looks like it detected the session was stopped.
However, after aborting our job in Jenkins, it looks like the hub tried again to find that session, throwing
From the tests log, it looks like no action was done after starting WebDriver instance (usually we should see some info after "Starting the test" line like the steps from the feature file, etc.):
We're currently using Docker Selenium It looks like something is hanging somewhere, but I'm not sure how to continue investigating or what could we try to reproduce this issue as it happens very randomly. Complete logs from faulty node & hub: Thanks. |
I have the same issues.
WebDriver parameters:
when create second WebDriver, session will add the queue, and timeout. |
I think this would be fixed via SeleniumHQ/selenium#14272 |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
What happened?
We have a WebdriverIO test automation project.
When triggering a test job from Jenkins (usually between 10-15 parallel sessions), 99% of the time there are between 1 and 4 sessions that time out.
We are normally using the dynamic grid, but switched to fixed size for debugging, so this is what goes below.
Tried various hub/node combinations of (latest) versions, I could not find one that makes this problem disappear.
Attached Selenium hub log where 15 sessions where triggered, 3 of them having failed. Failing ones are: 0b737c5b3589c87ab9265ee4fe67f778
32b070ab4b7c82b437c42fe626c66d98
e350ea44563eb21a1b29519002fea78d
Also attached the node log of failing session e350ea44563eb21a1b29519002fea78d.
I was not able to find the chromedriver log inside the node container. Is it possible to get it? If yes, how/where is it?
Command used to start Selenium Grid with Docker (or Kubernetes)
Relevant log output
Operating System
Ubuntu
Docker Selenium version (image tag)
4.16.1
Selenium Grid chart version (chart version)
No response
The text was updated successfully, but these errors were encountered: