Accessing server API at gitpod.io from within workspaces might fail sometimes #8703

jankeromnes · 2022-03-10T08:58:27Z

Bug description

Sometimes, when you start a workspace ~~around the same time as a Gitpod WebApp deployment (typically on Tuesdays and Thursdays around ~08:30 UTC)~~, it can get into a bad state.

Symptoms:

The IDE theme is always Light (quite noticeable if you normally use Dark theme)
The Git credential-helper doesn't work (e.g. if you try to push to a new branch, it will ask you for username and password)
Some Gitpod features are missing (e.g. if you look for Gitpod: Stop Workspace in F1, you won't find it)

The only solution is to stop the workspace (e.g. from your workspaces list) and restart it.

Steps to reproduce

Start a workspace ~~on a Tuesday or Thursday around ~08:30 UTC~~
Try to push something to a branch
It will ask for username and password

Workspace affected

gitpodio-gitpod-ykfphcb8pwd

Expected behavior

The workspace should not land into such a bad state or require a restart

Example repository

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

jankeromnes · 2022-03-10T09:03:19Z

Copy/pasting some findings from this Slack thread (internal):

It seems that supevisor cannot access github token from server.

Uh, it seems that workspace could not establish connection to server at all:

from logs server does not want to auth VS Code (actually everything in workspace) although it granted token

Is it because the traffic is going to the incorrect cluster?
It mainly happened between the two spikes on the left: First is EU rollout, 2nd US. between those, db-sync is offline. Could be this is a combination of that plus the traffic being redirected to another cluster.

Also:

check log /tmp/gitpod-git-credential-helper.log for detail

Other reports of this problem:

gitpod workspace auth issue #8631
Me on Slack: 1, 2 (internal)
Customer on Slack (internal)
@MrSimonEmms on Slack (internal)

jankeromnes · 2022-03-10T09:14:34Z

Scheduling for Team WebApp due to this request:

It would be cool if someone can investigate before next webapp deployment, looks like something leads to inconsistent DB state which was not the case a week ago

geropl · 2022-03-11T08:17:06Z

I removed the workspace & IDe labels to avoid noise/cross chat. IMO we need to investigate first, and then maybe derive sth for IDE / workspace.

Also, moving this into "needs design", because it's not actionable, yet.

jankeromnes · 2022-03-11T08:59:07Z

Thanks Gero, I agree. 👍

Also, moving this into "needs design", because it's not actionable, yet.

Interesting 💡 maybe we need a sort of "needs investigation" category too (or we could rename "needs design" -- not sure about this) 🤔 💭

AlexTugarev · 2022-03-11T11:53:49Z

I'm curious, how the reconnection behavior actually looks like for the workspace/supervisor and how the IDE is supposed to re-request data which might have failed due to rollout of new server instances.

geropl · 2022-03-11T14:32:10Z

@AlexTugarev The process is:

startWorkspace (instance)
- create token 1
- create OTS 1
wsManager.startWorkspace
...
dashboard:
fetch OTS 1 from server, now holding a GP token 1 <-- @akosyakov reported that he saw frontend actually had a token
... <-- what happens here???
dashboard tries to execute API call using GP token 1: 💥

Is there any way you could think of that we delete token 1 before it is actually retrieved? 🤔

Hypothesis 1: Maybe be creating another token, and deleting the old one? What could trigger this?

Hypothesis 2: During rollout the GLB decides that new connection go the other (US) cluster, where the token is not (yet) properly synced to. (do we retry here?)

jankeromnes · 2022-03-14T08:54:37Z

Actually, this seems to happen even without webapp deployments.

I just hit it this morning, while opening #8757 in Gitpod:

Light theme (expected: Dark)
Broken Git authentication
No F1 > Gitpod: Stop Workspace command

$ cat /tmp/gitpod-git-credential-helper.log
time="2022-03-14T08:51:37Z" level=info msg="error getting token from supervisior" error="rpc error: code = NotFound desc = no token available"

Will update the title & description.

akosyakov · 2022-03-17T12:51:47Z

I think we should put priority: highest (user impact), deployments consistently breaking workspaces. It happened again: https://gitpod.slack.com/archives/C01KGM9BH54/p1647519959995009

geropl · 2022-03-18T07:48:06Z

From internal discussion:

Queries:

@AlexTugarev metioned:

also, could it be there are 2 gitpod tokens stored? I’m not familiar with the bearer token auth in the code, but it calls getOne on the user~token join, so if there would be more tokens stored in the db, that could also pick one which is not accepted.

from git history it looks like you could answer this: where is the GitpodTokenType.API_AUTH_TOKEN which is required/checked in the bearer auth actually written? I cannot find any places where this is written to db.

related to #8703

AlexTugarev · 2022-04-05T13:16:49Z

@geropl and I did review the auth flow of supervisor and do now assume that it very well might be an issue with server pods being not in-sync. the problem might be, when we start a workspace from a server in one cluster, but the supervisor's requests are handled by server in the other cluster, the tokens in the DB might not be sync'd yet.

with #9130 (merged) we should be able to verify the assumption.

jankeromnes · 2022-04-06T13:16:47Z

This bug just happened to me again right now:

https://gitpodio-gitpod-4l1i0gza3k4.ws-eu38xl.gitpod.io/

Closing the tab and starting a new workspace resolved it.

svenefftinge · 2022-04-20T15:20:22Z

Regarding #9130 the logs show that the region and the server location seem to match, so probably not a db-sync issue.

geropl · 2022-04-20T15:44:54Z

I looked at a single failing case, starting from the log message "jsonrpc2: protocol error: reconnecting-ws: bad handshake: code 401". The respective workspace instance is still running, but the user has no token for that workspace instance, in any database.

geropl · 2022-04-20T19:00:10Z

Ok, this turned out different than expected: We actually delete the token here - because we receive a stopped event for that workspaceinstance 😢 :

here is the sequence of events as receive by ws-manager-bridge: logs
here is the specific stopped event in the logs: event
we're lucky, and also have a trace proving all of this: span in honeycomb

The sequence clearly shows that the bridge receives a series of well ordered events (as proven by statusVersion, which flickers between pending (1) and stopped (6). 💥

@kylos101 I'm inclined to ask if this be related to the recently introduced "retry" feature. But I think we've been seeing this for longer than this feature exists. Could someone from workspace look into this? Maybe with support from WebApp side? I don't think we can do much about this in bridge, except from being a bit louder when notice badly ordered status updates. 🤔

kylos101 · 2022-04-20T19:11:15Z

I see that @sagor999 is looking at this now, he's definitely the right person. 👍 Do you think we're emitting status many times, while exponentially backing off, and that is causing this behavior?

sagor999 · 2022-04-20T19:45:17Z

I think this might be due to retry logic. 🤔
Reason for thinking so:
In logs provided, I see few updates with this message: pod is pending (And I do see the next ws-bridge update happens exactly 5 seconds later with status set to stopped.)
When pod is pending and cannot get scheduled within 5 seconds, ws-manager will delete that pod and try to create a new one.

When it deletes the pod, it also removes the finalizer first.
If at the same time call is made to extractStatusFromPod:

gitpod/components/ws-manager/pkg/manager/status.go

Lines 329 to 351 in 4d48ccb

    
           if isPodBeingDeleted(pod) { 
        
           	result.Phase = api.WorkspacePhase_STOPPING 
        
           	_, podFailedBeforeBeingStopped := pod.Annotations[workspaceFailedBeforeStoppingAnnotation] 
        
           	if !podFailedBeforeBeingStopped { 
        
           		// While the pod is being deleted we do not care or want to know about any failure state. 
        
           		// If the pod got stopped because it failed we will have sent out a Stopping status with a "failure" 
        
           		result.Conditions.Failed = "" 
        
           	} 
        
           	var hasFinalizer bool 
        
           	for _, f := range wso.Pod.Finalizers { 
        
           		if f == gitpodFinalizerName { 
        
           			hasFinalizer = true 
        
           			break 
        
           		} 
        
           	} 
        
           	if !hasFinalizer { 
        
           		// We do this independently of the dispostal status because pods only get their finalizer 
        
           		// once they're running. If they fail before they reach the running phase we'll never see 
        
           		// a disposal status, hence would never stop the workspace. 
        
           		result.Phase = api.WorkspacePhase_STOPPED 
        
           	}

it will check if pod is being delete and if finalizer is removed, then it will update workspace status to stopped.

Back in ws-manager retry logic, it will now create a new pod, and subsequent call to extractStatusFromPod will now update the status back to pending:

gitpod/components/ws-manager/pkg/manager/status.go

Line 409 in 4d48ccb

result.Phase = api.WorkspacePhase_PENDING

Let me think how to better fix this. In theory we would want to short circuit extractStatusFromPod if ws-manager is still attempting to create the pod... without introducing some other odd behavior. 😄

csweichel · 2022-04-20T19:51:30Z

We could add an annotation when we create the pod and remove it once we're out of the retry logic. This annotation would inform the status updates we send out from ws-manager.

sagor999 · 2022-04-20T19:52:47Z

That's what I was thinking as well ^ @csweichel
I think it is fairly safe, with only exception is that if for some reason we fail to update the pod to remove that annotation after. But I guess we can be very loud about that error if it happens.

lucasvaltl · 2022-04-21T08:38:54Z

fwiw, I also just ran into this issue within this workspace: https://gitpodio-website-kfjstnnof79.ws-eu34.gitpod.io/

My IDE theme is still dark though, contrary to what was said in the original issue description:

The IDE theme is always Light (quite noticeable if you normally use Dark theme)

lucasvaltl · 2022-04-23T09:09:15Z

Ran into it again (today @11:02 GMT+2) here: https://gitpodio-website-8vw6v27o15p.ws-eu41.gitpod.io/

FYI @sagor999

--> This time it switched my theme to white 🤔

sagor999 · 2022-04-23T15:03:57Z

@lucasvaltl it is deployed with gen42 clusters. Yours was still on gen41.

jankeromnes added type: bug Something isn't working operations: past incident This issue arose during a past incident or its post-mortem team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team team: IDE labels Mar 10, 2022

jankeromnes added this to 🍎 WebApp Team Mar 10, 2022

jankeromnes moved this to Scheduled in 🍎 WebApp Team Mar 10, 2022

geropl removed team: workspace Issue belongs to the Workspace team team: IDE labels Mar 11, 2022

geropl moved this from Scheduled to Needs Design in 🍎 WebApp Team Mar 11, 2022

jankeromnes changed the title ~~When starting a workspace around the same time as a webapp deployment, it can get into a bad state (needs restart)~~ Workspaces can sometimes start in a bad state (requires workspace restart) Mar 14, 2022

geropl added the priority: highest (user impact) Directly user impacting label Mar 18, 2022

geropl moved this from Needs Design to Scheduled in 🍎 WebApp Team Mar 18, 2022

geropl mentioned this issue Apr 5, 2022

gitpod workspace auth issue #8631

Closed

AlexTugarev added a commit that referenced this issue Apr 5, 2022

[server] log client region on bad bearer auth requests

a947c74

related to #8703

AlexTugarev mentioned this issue Apr 5, 2022

[server] log client region on bad bearer auth requests #9130

Merged

geropl changed the title ~~Workspaces can sometimes start in a bad state (requires workspace restart)~~ Accessing server API at gitpod.io from within workspaces might fail sometimes Apr 5, 2022

geropl mentioned this issue Apr 5, 2022

Prebuilds failing with "failed OTS download" #8595

Closed

roboquat pushed a commit that referenced this issue Apr 5, 2022

[server] log client region on bad bearer auth requests

760c1d5

related to #8703

geropl self-assigned this Apr 20, 2022

geropl moved this from Scheduled to In Progress in 🍎 WebApp Team Apr 20, 2022

sagor999 added the team: workspace Issue belongs to the Workspace team label Apr 20, 2022

sagor999 added this to 🌌 Workspace Team Apr 20, 2022

sagor999 moved this to In Progress in 🌌 Workspace Team Apr 20, 2022

sagor999 assigned sagor999 and unassigned geropl Apr 20, 2022

sagor999 removed this from 🍎 WebApp Team Apr 20, 2022

sagor999 mentioned this issue Apr 20, 2022

[ws-manager] fix workspace status flipping pending to deleted #9438

Merged

roboquat closed this as completed in #9438 Apr 21, 2022

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Apr 21, 2022

jankeromnes mentioned this issue Jun 23, 2022

My IDE loaded into "unauthenticated" state again #10864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing server API at gitpod.io from within workspaces might fail sometimes #8703

Accessing server API at gitpod.io from within workspaces might fail sometimes #8703

jankeromnes commented Mar 10, 2022 •

edited

Loading

jankeromnes commented Mar 10, 2022 •

edited

Loading

jankeromnes commented Mar 10, 2022

geropl commented Mar 11, 2022

jankeromnes commented Mar 11, 2022

AlexTugarev commented Mar 11, 2022

geropl commented Mar 11, 2022

jankeromnes commented Mar 14, 2022 •

edited

Loading

akosyakov commented Mar 17, 2022

geropl commented Mar 18, 2022

AlexTugarev commented Apr 5, 2022

jankeromnes commented Apr 6, 2022 •

edited

Loading

svenefftinge commented Apr 20, 2022

geropl commented Apr 20, 2022

geropl commented Apr 20, 2022

kylos101 commented Apr 20, 2022

sagor999 commented Apr 20, 2022 •

edited

Loading

csweichel commented Apr 20, 2022

sagor999 commented Apr 20, 2022 •

edited

Loading

lucasvaltl commented Apr 21, 2022

lucasvaltl commented Apr 23, 2022

sagor999 commented Apr 23, 2022

Accessing server API at gitpod.io from within workspaces might fail sometimes #8703

Accessing server API at gitpod.io from within workspaces might fail sometimes #8703

Comments

jankeromnes commented Mar 10, 2022 • edited Loading

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

jankeromnes commented Mar 10, 2022 • edited Loading

jankeromnes commented Mar 10, 2022

geropl commented Mar 11, 2022

jankeromnes commented Mar 11, 2022

AlexTugarev commented Mar 11, 2022

geropl commented Mar 11, 2022

jankeromnes commented Mar 14, 2022 • edited Loading

akosyakov commented Mar 17, 2022

geropl commented Mar 18, 2022

AlexTugarev commented Apr 5, 2022

jankeromnes commented Apr 6, 2022 • edited Loading

svenefftinge commented Apr 20, 2022

geropl commented Apr 20, 2022

geropl commented Apr 20, 2022

kylos101 commented Apr 20, 2022

sagor999 commented Apr 20, 2022 • edited Loading

csweichel commented Apr 20, 2022

sagor999 commented Apr 20, 2022 • edited Loading

lucasvaltl commented Apr 21, 2022

lucasvaltl commented Apr 23, 2022

sagor999 commented Apr 23, 2022

jankeromnes commented Mar 10, 2022 •

edited

Loading

jankeromnes commented Mar 10, 2022 •

edited

Loading

jankeromnes commented Mar 14, 2022 •

edited

Loading

jankeromnes commented Apr 6, 2022 •

edited

Loading

sagor999 commented Apr 20, 2022 •

edited

Loading

sagor999 commented Apr 20, 2022 •

edited

Loading