-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cr-syncer does not retry on failed updates #64
Comments
I made another observation. Sometimes the status of robots is not updated in cloud cluster, even though cr-syncer logs give the impression that the status was synced. You see that on the robot its status is continuously updating, cr-syncer receives the update event, but it is not synced to cloud cluster. Looking into the nginx log of cloud cluster, there is a 409 error like this in the meantime for each robot and then no more log entries for those CRs for a while. Thus, cr-syncer is probably not calling the cloud cluster kube-api, even when it is still receiving update events.
It seems that this issue occurs at all CRs of robot CRD. Either CR updates of all robots fail or all are succeeding. There might be the same issue for other CRDs too, but it does not necessarily happens at the same time. |
I just looked for the 409 error and API server returns it, if there is a conflict. A reason for this is that cr-syncer refers to an too old version of the CR. I don't know why this happens at the robot CR, because we only change it manually, when changing the labels, but it already happened several times since yesterday. |
I was wrong regarding the number of 409 requests. There is not just 1 request for each CR but ~40 requests within a couple of seconds. Then there is a break for about 10 minutes, before requests for this CRD start again. These requests are successful HTTP 200 requests. Maybe this storm of failed request leads the rate limit interface to back off for a while until it starts with new requests again. How about using Patch instead of Update and UpdateStatus? In our case it is clear that upstream is always the origin of spec and downstream always the origin of status. Thus, there is no risk that upstream changed something downstream is going to update or vice versa. |
Questions for you:
This is plausible. I've not seen the 409 storms, and I only see upstream 409s when a separate process is updating the upstream resource too, which makes me wonder what is different in your setup. The 409s I see are immediately and successfully retried. Given that 409s are quite common, I'm surprised the controller-runtime is handling them so badly here: I would expect to see exponentional backoff rather than many-retries-then-a-long-pause.
I can't reproduce this behavior, so something weird is going on. In your video, this was already broken at the start. There are a couple of codepaths that print neither success nor failure:
If we can't come up with a better idea, adding print debugging to the start of syncDownstream and to the silent return paths would narrow this down.
I'm not 100% about the mechanics of Patch but it seems like it would lead to divergence between the upstream and downstream resources. For example, if you add a property downstream, then remove it, you would need some way of crafting the patch so that that property is deleted upstream as well. Update/UpdateStatus have the semantics that we want here. Also, if we can't trust the resourceVersion that we're getting, we can't trust the spec/status either, so this would probably just break differently. |
This is how the issue started on the robot side. You find like 8
I wondered why the issue occurs simultaneously on all robots and I maybe found the reason why in the Cloud Cluster which is running on version |
That explanation sounds right on the money. The logs show:
The cr-syncer never receives the |
Maybe an exponential backoff is not the best choice for cr-syncer. As you see here there is a series of very fast retries with a high probability that the key is not in sync yet, followed by a very long backoff time. A more linear approach would be better for this case. Btw. I fixed our |
I changed the limiters in this commit.
|
Anyhow, it is running better, because it at least refreshes every 5 minutes when the resyncPeriod is over. if we reduce this to 1 minute, this is at least a workaround. |
I'm not sure if you mentioned this anywhere, but does/did this issue occur
before the recent changes? I don't see why the removal of the finalizer
would affect the timing of update events from the cloud cluster, so I'd
expect the recent changes to be unrelated, but it would be nice to know for
sure.
It seems that either the cloud apiserver, the nginx-ingress, or the
SharedInformerCache is prevent the cr-syncer from seeing update events, but
if the apiserver and ingress are healthy I can't think of a cause.
… |
I just checked the logs and the issue appeared before too, but by far less often than now. Additionally it appeared rarely on Robot CRD but on other CRD which already had a status subresource. My guess is that this is somehow related to the activation of status subresources in CRDs and we notice it right now because of the special circumstances with the robot CRD. |
Thanks, that seems mostly likely. I would guess that the apiserver is misbehaving. I will try updating my cluster to 1.16.15-gke.7800 and seeing if I can reproduce the issue, although I was already on 1.16.15-gke.6000, so the difference is probably due to the differing use patterns (I was doing frequent status updates but hadn't been doing them for a long time). I guess the easiest workaround for you would be to either revert to the older release, or to disable the status subresource on the Robot CR and set CR_SYNCER_RBAC=false. |
During my last test, when I just disabled status subresource, it went well for a while. Then, the errors appeared again...I don't know why. It supports status and status subtree, adds the part which is owned by the robot (removes manually created items in this section) and leaves the rest alone, in case it is a status subtree. Yet, there are open points:
This solution looks promising for me. The concept of federation layer describes pretty clear which properties are owned by cloud and which by the robots. This could be handled pretty well with Patching CRs, so I don't see much benefit in messing around with not completely synced keys. What do you think? |
I might be missing something, but it seems that patching doesn't address the root cause. As I understand it:
Patching avoids the 409s because it means that the status can still be updated after the upstream watch fails. However, the cr-syncer will still fail to sync the spec, as it won't get update events. If the upstream spec changes, the downstream spec will not be updated. This is not so bad for the Robot (as spec changes are rare) but would presumably be bad for WarehouseOrders or ChartAssignments. Do you think that's the case? Or is there a reason that the spec syncing would behave differently after this change? |
Last week on Friday, I had some time and finalized the patch version of cr-syncer for up- and downstream including fixing the tests and some cleanup. You can find the result here, if you like to have a look at it. First the good thing. I finally found the root cause and "solved" it. The reason for all this was a bit unexpected, so let me summarize, what went wrong:
That could be a kind of a pitfall in case any crashing app with an ingress configuration could break the robot communication. Maybe there is a way to change nginx configuration to mitigate that risk. From my perspective it is worth to invest some time here. Even if it was not the root cause, there are some points at cr-syncer, we might be able to improve
@drigz What do you think? |
Why did "each Prometheus restart and crash leaded to a reload of nginx configuration"? Or was the nginx killed to free up memory due to Prometheus spiking in memory use? |
Amazing work Oliver! I can reproduce this by deleting the prometheus pod in the GKE cluster.
When it crashes ingress-nginx rewrites the nginx config to return a 503 for /prometheus/. When the pod comes back up, the working config is restored. Reloading the config causes nginx to shut down the old workers, terminating the existing WATCH connections. By testing with curl, it appears to be getting a TLS close notify, so I don't know why client-go doesn't start a new WATCH. Changing the rate limiter and/or resync period to avoid extreme backoff seems sensible but not sufficient in my opinion. Config updates are not rare, and as you say, brief network outages could cause similar behavior. Things I would like to try first:
|
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#changelog-since-v1194: I believe upgrading the Kubernetes client to v1.19.4 will improve the behavior if the network drops out, by setting a timeout of 30s. We might want to apply this config manually as upgrading to v1.19.4 may not be trivial. I still haven't worked out why client-go doesn't notice the graceful close, though:
|
This limits the effect of a connection dropout by detecting it and reconnecting after ~30s. This is added to Kubernetes in v1.19.4, but I think we need to switch to Go modules before updating to that. Fixes googlecloudrobotics#64. Change-Id: I709b6bcecc35fd28bd904e6047bf479b0609117d
The newer version will gracefully close the connection to the cr-syncer when the config reloads, so this goes some way towards fixing googlecloudrobotics#64 (although the cr-syncer should handled dropped connections better). The Deployment is based on upstream's deploy.yaml, but without RBAC policies, the namespace and the service account, to avoid changing too much. The newer Deployment drops privileges and has more graceful shutdown behavior. https://github.com/kubernetes/ingress-nginx/blob/master/deploy/static/provider/cloud/deploy.yaml This requires adjusting the ingress due to the following breaking changes (which, incidentally, despite the "breaking changes" category in the changelog, were not listed as breaking...): - the secure-backends annotation has been removed - path rewriting now uses regex groups rather than prefixes While we're here, we can stop using the deprecated extensions/v1beta1 Ingress. Change-Id: I24c266f21e489a5c9079e9f884699c84d6be6a0e
Good news, I think I got to the bottom of this. curl behaves different because it is only doing one streaming request. When two or more streaming requests are going through a single HTTP/2 connection, the ingress will shutdown differently: it immediately sends GOAWAY (Go seems to ignore this) and then drops the connection (without a TLS close-notify or TCP FIN) ten seconds later. 30s later, Go's TCP keepalive notices this, but that doesn't seem to propagate through the http2 later, and so client-go doesn't see it and restart the watch. Upgrading nginx is sufficient to improve the behavior, although that won't help when the connection is dropped for real due to lost connectivity. I'll send the following changes for internal review:
|
The newer version will gracefully close the connection to the cr-syncer when the config reloads, so this goes some way towards fixing #64 (although the cr-syncer should handled dropped connections better). The Deployment is based on upstream's deploy.yaml, but without RBAC policies, the namespace and the service account, to avoid changing too much. The newer Deployment drops privileges and has more graceful shutdown behavior. https://github.com/kubernetes/ingress-nginx/blob/master/deploy/static/provider/cloud/deploy.yaml This requires adjusting the ingress due to the following breaking changes (which, incidentally, despite the "breaking changes" category in the changelog, were not listed as breaking...): - the secure-backends annotation has been removed - path rewriting now uses regex groups rather than prefixes While we're here, we can stop using the deprecated extensions/v1beta1 Ingress. Change-Id: I24c266f21e489a5c9079e9f884699c84d6be6a0e GitOrigin-RevId: 564dd80
@drigz good work 👍 |
Couldn't have figured it out without your help, thank you too! |
We find quite a number of these errors in our cr-syncer logs.
cr-syncer does not retry applying the change as the error message suggests. If there are other update events from downstream cluster, it will get in sync again. But if there are no more change events from downstream for that CR it remains in a not synced state.
When a CR is in this state it even does sync changes from upstream cluster anymore.
It remains in this inconsistent state until cr-syncer is restarted.
The text was updated successfully, but these errors were encountered: