Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing server API at gitpod.io from within workspaces might fail sometimes #8703

Closed
jankeromnes opened this issue Mar 10, 2022 · 21 comments · Fixed by #9438
Closed

Accessing server API at gitpod.io from within workspaces might fail sometimes #8703

jankeromnes opened this issue Mar 10, 2022 · 21 comments · Fixed by #9438
Assignees
Labels
operations: past incident This issue arose during a past incident or its post-mortem priority: highest (user impact) Directly user impacting team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@jankeromnes
Copy link
Contributor

jankeromnes commented Mar 10, 2022

Bug description

Sometimes, when you start a workspace around the same time as a Gitpod WebApp deployment (typically on Tuesdays and Thursdays around ~08:30 UTC), it can get into a bad state.

Symptoms:

  • The IDE theme is always Light (quite noticeable if you normally use Dark theme)
  • The Git credential-helper doesn't work (e.g. if you try to push to a new branch, it will ask you for username and password)
  • Some Gitpod features are missing (e.g. if you look for Gitpod: Stop Workspace in F1, you won't find it)

The only solution is to stop the workspace (e.g. from your workspaces list) and restart it.

Screenshot 2022-03-10 at 09 29 12

Steps to reproduce

  1. Start a workspace ~~on a Tuesday or Thursday around ~08:30 UTC~~
  2. Try to push something to a branch
  3. It will ask for username and password

Workspace affected

gitpodio-gitpod-ykfphcb8pwd

Expected behavior

  1. The workspace should not land into such a bad state or require a restart

Example repository

No response

Anything else?

No response

@jankeromnes jankeromnes added type: bug Something isn't working operations: past incident This issue arose during a past incident or its post-mortem team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team team: IDE labels Mar 10, 2022
@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 10, 2022

Copy/pasting some findings from this Slack thread (internal):

It seems that supevisor cannot access github token from server.

Uh, it seems that workspace could not establish connection to server at all:
Screenshot 2022-03-08 at 09 21 33

from logs server does not want to auth VS Code (actually everything in workspace) although it granted token

Is it because the traffic is going to the incorrect cluster?
It mainly happened between the two spikes on the left: First is EU rollout, 2nd US. between those, db-sync is offline. Could be this is a combination of that plus the traffic being redirected to another cluster.

Also:

check log /tmp/gitpod-git-credential-helper.log for detail


Other reports of this problem:

@jankeromnes
Copy link
Contributor Author

Scheduling for Team WebApp due to this request:

It would be cool if someone can investigate before next webapp deployment, looks like something leads to inconsistent DB state which was not the case a week ago

@jankeromnes jankeromnes moved this to Scheduled in 🍎 WebApp Team Mar 10, 2022
@geropl geropl removed team: workspace Issue belongs to the Workspace team team: IDE labels Mar 11, 2022
@geropl
Copy link
Member

geropl commented Mar 11, 2022

I removed the workspace & IDe labels to avoid noise/cross chat. IMO we need to investigate first, and then maybe derive sth for IDE / workspace.

Also, moving this into "needs design", because it's not actionable, yet.

@geropl geropl moved this from Scheduled to Needs Design in 🍎 WebApp Team Mar 11, 2022
@jankeromnes
Copy link
Contributor Author

Thanks Gero, I agree. 👍

Also, moving this into "needs design", because it's not actionable, yet.

Interesting 💡 maybe we need a sort of "needs investigation" category too (or we could rename "needs design" -- not sure about this) 🤔 💭

@AlexTugarev
Copy link
Member

I'm curious, how the reconnection behavior actually looks like for the workspace/supervisor and how the IDE is supposed to re-request data which might have failed due to rollout of new server instances.

@geropl
Copy link
Member

geropl commented Mar 11, 2022

@AlexTugarev The process is:

  • startWorkspace (instance)
    • create token 1
    • create OTS 1
  • wsManager.startWorkspace
  • ...
  • dashboard:
  • fetch OTS 1 from server, now holding a GP token 1 <-- @akosyakov reported that he saw frontend actually had a token
  • ... <-- what happens here???
  • dashboard tries to execute API call using GP token 1: 💥

Is there any way you could think of that we delete token 1 before it is actually retrieved? 🤔

Hypothesis 1: Maybe be creating another token, and deleting the old one? What could trigger this?

Hypothesis 2: During rollout the GLB decides that new connection go the other (US) cluster, where the token is not (yet) properly synced to. (do we retry here?)

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 14, 2022

Actually, this seems to happen even without webapp deployments.

I just hit it this morning, while opening #8757 in Gitpod:

Screenshot 2022-03-14 at 09 51 46

  • Light theme (expected: Dark)
  • Broken Git authentication
  • No F1 > Gitpod: Stop Workspace command
$ cat /tmp/gitpod-git-credential-helper.log
time="2022-03-14T08:51:37Z" level=info msg="error getting token from supervisior" error="rpc error: code = NotFound desc = no token available"

Will update the title & description.

@jankeromnes jankeromnes changed the title When starting a workspace around the same time as a webapp deployment, it can get into a bad state (needs restart) Workspaces can sometimes start in a bad state (requires workspace restart) Mar 14, 2022
@akosyakov
Copy link
Member

I think we should put priority: highest (user impact), deployments consistently breaking workspaces. It happened again: https://gitpod.slack.com/archives/C01KGM9BH54/p1647519959995009

@geropl
Copy link
Member

geropl commented Mar 18, 2022

From internal discussion:

Queries:

@AlexTugarev metioned:

also, could it be there are 2 gitpod tokens stored? I’m not familiar with the bearer token auth in the code, but it calls getOne on the user~token join, so if there would be more tokens stored in the db, that could also pick one which is not accepted.

from git history it looks like you could answer this: where is the GitpodTokenType.API_AUTH_TOKEN which is required/checked in the bearer auth actually written? I cannot find any places where this is written to db.

@geropl geropl added the priority: highest (user impact) Directly user impacting label Mar 18, 2022
@geropl geropl moved this from Needs Design to Scheduled in 🍎 WebApp Team Mar 18, 2022
@geropl geropl changed the title Workspaces can sometimes start in a bad state (requires workspace restart) Accessing server API at gitpod.io from within workspaces might fail sometimes Apr 5, 2022
roboquat pushed a commit that referenced this issue Apr 5, 2022
@AlexTugarev
Copy link
Member

@geropl and I did review the auth flow of supervisor and do now assume that it very well might be an issue with server pods being not in-sync. the problem might be, when we start a workspace from a server in one cluster, but the supervisor's requests are handled by server in the other cluster, the tokens in the DB might not be sync'd yet.

with #9130 (merged) we should be able to verify the assumption.

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Apr 6, 2022

This bug just happened to me again right now:

https://gitpodio-gitpod-4l1i0gza3k4.ws-eu38xl.gitpod.io/

Screenshot 2022-04-06 at 15 13 47

Closing the tab and starting a new workspace resolved it.

@svenefftinge
Copy link
Member

Regarding #9130 the logs show that the region and the server location seem to match, so probably not a db-sync issue.

@geropl geropl self-assigned this Apr 20, 2022
@geropl geropl moved this from Scheduled to In Progress in 🍎 WebApp Team Apr 20, 2022
@geropl
Copy link
Member

geropl commented Apr 20, 2022

I looked at a single failing case, starting from the log message "jsonrpc2: protocol error: reconnecting-ws: bad handshake: code 401". The respective workspace instance is still running, but the user has no token for that workspace instance, in any database.

@geropl
Copy link
Member

geropl commented Apr 20, 2022

Ok, this turned out different than expected: We actually delete the token here - because we receive a stopped event for that workspaceinstance 😢 :

  • here is the sequence of events as receive by ws-manager-bridge: logs
  • here is the specific stopped event in the logs: event
  • we're lucky, and also have a trace proving all of this: span in honeycomb

The sequence clearly shows that the bridge receives a series of well ordered events (as proven by statusVersion, which flickers between pending (1) and stopped (6). 💥

@kylos101 I'm inclined to ask if this be related to the recently introduced "retry" feature. But I think we've been seeing this for longer than this feature exists. Could someone from workspace look into this? Maybe with support from WebApp side? I don't think we can do much about this in bridge, except from being a bit louder when notice badly ordered status updates. 🤔

@kylos101
Copy link
Contributor

I see that @sagor999 is looking at this now, he's definitely the right person. 👍 Do you think we're emitting status many times, while exponentially backing off, and that is causing this behavior?

@sagor999
Copy link
Contributor

sagor999 commented Apr 20, 2022

I think this might be due to retry logic. 🤔
Reason for thinking so:
In logs provided, I see few updates with this message: pod is pending (And I do see the next ws-bridge update happens exactly 5 seconds later with status set to stopped.)
When pod is pending and cannot get scheduled within 5 seconds, ws-manager will delete that pod and try to create a new one.

When it deletes the pod, it also removes the finalizer first.
If at the same time call is made to extractStatusFromPod:

if isPodBeingDeleted(pod) {
result.Phase = api.WorkspacePhase_STOPPING
_, podFailedBeforeBeingStopped := pod.Annotations[workspaceFailedBeforeStoppingAnnotation]
if !podFailedBeforeBeingStopped {
// While the pod is being deleted we do not care or want to know about any failure state.
// If the pod got stopped because it failed we will have sent out a Stopping status with a "failure"
result.Conditions.Failed = ""
}
var hasFinalizer bool
for _, f := range wso.Pod.Finalizers {
if f == gitpodFinalizerName {
hasFinalizer = true
break
}
}
if !hasFinalizer {
// We do this independently of the dispostal status because pods only get their finalizer
// once they're running. If they fail before they reach the running phase we'll never see
// a disposal status, hence would never stop the workspace.
result.Phase = api.WorkspacePhase_STOPPED
}

it will check if pod is being delete and if finalizer is removed, then it will update workspace status to stopped.

Back in ws-manager retry logic, it will now create a new pod, and subsequent call to extractStatusFromPod will now update the status back to pending:

result.Phase = api.WorkspacePhase_PENDING

Let me think how to better fix this. In theory we would want to short circuit extractStatusFromPod if ws-manager is still attempting to create the pod... without introducing some other odd behavior. 😄

@sagor999 sagor999 added the team: workspace Issue belongs to the Workspace team label Apr 20, 2022
@sagor999 sagor999 moved this to In Progress in 🌌 Workspace Team Apr 20, 2022
@csweichel
Copy link
Contributor

We could add an annotation when we create the pod and remove it once we're out of the retry logic. This annotation would inform the status updates we send out from ws-manager.

@sagor999
Copy link
Contributor

sagor999 commented Apr 20, 2022

That's what I was thinking as well ^ @csweichel
I think it is fairly safe, with only exception is that if for some reason we fail to update the pod to remove that annotation after. But I guess we can be very loud about that error if it happens.

@lucasvaltl
Copy link
Contributor

fwiw, I also just ran into this issue within this workspace: https://gitpodio-website-kfjstnnof79.ws-eu34.gitpod.io/

My IDE theme is still dark though, contrary to what was said in the original issue description:

The IDE theme is always Light (quite noticeable if you normally use Dark theme)

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Apr 21, 2022
@lucasvaltl
Copy link
Contributor

Ran into it again (today @11:02 GMT+2) here: https://gitpodio-website-8vw6v27o15p.ws-eu41.gitpod.io/

FYI @sagor999

--> This time it switched my theme to white 🤔

@sagor999
Copy link
Contributor

@lucasvaltl it is deployed with gen42 clusters. Yours was still on gen41.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operations: past incident This issue arose during a past incident or its post-mortem priority: highest (user impact) Directly user impacting team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants