-
Notifications
You must be signed in to change notification settings - Fork 38
Disconnects from cloud #99
Comments
Running the 0.21.3 branch with the heartbeat code. Connected fine. Ran at DEBUG log level for several hours, then switched to INFO to surface errors. Load on the broker is next to nothing. Workers polling at 30 second intervals. ~12 workers connected.
When the cluster reschedule finished, the client came back up, and serviced tasks.
The 12:06 and 12:10 events were predicted, so there is a pattern.
|
Running the client against a local broker running in Docker, against a remote broker in the same data centre (AWS US East), or against a K8s cluster via port-forwarding (AWS US East to GKE AU South East), there are none of these errors. Therefore it is either the proxy or the broker config on Camunda Cloud. |
It's still disconnecting and requiring a reboot. Try this: if the channel is down for a set amount of time, destroy and recreate the channel. |
It looks like the worker channels are durable, but the client channel used to send commands becomes stale. At the moment you can't reliably inspect the state of the client channel because the worker channel state bubbles up through it. Will change that behaviour in #109. |
Any news on this? |
Are you seeing this issue in production? I would be surprised if you see it with Lambdas. It seems to affect long-lived connections. |
The Camunda Cloud team are now using this in production for their own systems, and are looking into the source of these issues. |
Looked into this today with @ColRad. We believe nginx receives the keepalive but doesn't pass it to the backend. Because the nginx <-> backend connection has no data on it, it's killed after 60s by the |
https://trac.nginx.org/nginx/ticket/1555 sound familiar? |
any chance you could put a debug counter on the requests? |
Fixed in 0.23.0. |
@jwulf thanks for the fix! what was the root cause? |
Since cloud went to Zeebe 0.21.1, this happens every day:
The client reports it is connected, but does not retrieve any jobs.
Could this be due to pod rescheduling?
The text was updated successfully, but these errors were encountered: