-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accessing server API at gitpod.io from within workspaces might fail sometimes #8703
Comments
Copy/pasting some findings from this Slack thread (internal):
Also:
Other reports of this problem:
|
Scheduling for Team WebApp due to this request:
|
I removed the workspace & IDe labels to avoid noise/cross chat. IMO we need to investigate first, and then maybe derive sth for IDE / workspace. Also, moving this into "needs design", because it's not actionable, yet. |
Thanks Gero, I agree. 👍
Interesting 💡 maybe we need a sort of "needs investigation" category too (or we could rename "needs design" -- not sure about this) 🤔 💭 |
I'm curious, how the reconnection behavior actually looks like for the workspace/supervisor and how the IDE is supposed to re-request data which might have failed due to rollout of new server instances. |
@AlexTugarev The process is:
Is there any way you could think of that we delete token 1 before it is actually retrieved? 🤔 Hypothesis 1: Maybe be creating another token, and deleting the old one? What could trigger this? Hypothesis 2: During rollout the GLB decides that new connection go the other (US) cluster, where the token is not (yet) properly synced to. (do we retry here?) |
Actually, this seems to happen even without webapp deployments. I just hit it this morning, while opening #8757 in Gitpod:
Will update the title & description. |
I think we should put |
@geropl and I did review the auth flow of supervisor and do now assume that it very well might be an issue with server pods being not in-sync. the problem might be, when we start a workspace from a server in one cluster, but the supervisor's requests are handled by server in the other cluster, the tokens in the DB might not be sync'd yet. with #9130 (merged) we should be able to verify the assumption. |
This bug just happened to me again right now: https://gitpodio-gitpod-4l1i0gza3k4.ws-eu38xl.gitpod.io/ Closing the tab and starting a new workspace resolved it. |
I looked at a single failing case, starting from the log message "jsonrpc2: protocol error: reconnecting-ws: bad handshake: code 401". The respective workspace instance is still running, but the user has no token for that workspace instance, in any database. |
Ok, this turned out different than expected: We actually delete the token here - because we receive a stopped event for that workspaceinstance 😢 :
The sequence clearly shows that the bridge receives a series of well ordered events (as proven by @kylos101 I'm inclined to ask if this be related to the recently introduced "retry" feature. But I think we've been seeing this for longer than this feature exists. Could someone from workspace look into this? Maybe with support from WebApp side? I don't think we can do much about this in bridge, except from being a bit louder when notice badly ordered status updates. 🤔 |
I see that @sagor999 is looking at this now, he's definitely the right person. 👍 Do you think we're emitting status many times, while exponentially backing off, and that is causing this behavior? |
I think this might be due to retry logic. 🤔 When it deletes the pod, it also removes the finalizer first. gitpod/components/ws-manager/pkg/manager/status.go Lines 329 to 351 in 4d48ccb
it will check if pod is being delete and if finalizer is removed, then it will update workspace status to stopped. Back in ws-manager retry logic, it will now create a new pod, and subsequent call to
Let me think how to better fix this. In theory we would want to short circuit |
We could add an annotation when we create the pod and remove it once we're out of the retry logic. This annotation would inform the status updates we send out from ws-manager. |
That's what I was thinking as well ^ @csweichel |
fwiw, I also just ran into this issue within this workspace: https://gitpodio-website-kfjstnnof79.ws-eu34.gitpod.io/ My IDE theme is still dark though, contrary to what was said in the original issue description:
|
Ran into it again (today @11:02 GMT+2) here: https://gitpodio-website-8vw6v27o15p.ws-eu41.gitpod.io/ FYI @sagor999 --> This time it switched my theme to white 🤔 |
@lucasvaltl it is deployed with gen42 clusters. Yours was still on gen41. |
Bug description
Sometimes, when you start a workspace
around the same time as a Gitpod WebApp deployment (typically on Tuesdays and Thursdays around ~08:30 UTC), it can get into a bad state.Symptoms:
Gitpod: Stop Workspace
inF1
, you won't find it)The only solution is to stop the workspace (e.g. from your workspaces list) and restart it.
Steps to reproduce
Workspace affected
gitpodio-gitpod-ykfphcb8pwd
Expected behavior
Example repository
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: