-
-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional failures to reconnect #257
Comments
cc: @RodneyU215 we probably still have some bugs here |
Thanks @dblock I'll take another look. Can you confirm which concurrency library you saw this occur with? |
Async, thanks |
Didn't debug yet but saw this stack trace when this happened recently:
|
@dblock Thanks for posting those logs. I previously was unable to duplicate the problem, but these logs are super helpful. I'll try to dig back into this soon. I'm working on another release, but I should have more time by Monday. |
@RodneyU215 So in this case the ping worker connection is closing and raising
Maybe @ioquatix can pitch in. What are the expectations of |
So, those are good questions. After looking at it, I wonder how At the moment, it calls The question is, should it (and/or) invoke The normal interpretation of close is probably But in this case, WebSocket has a Happy to change this behaviour. Thoughts? |
I don't know the protocol very well, but it sounds that
|
I think the difference is between a graceful shutdown and a close. They are two different things. There is one situation where this made sense to me. In HTTP/1, with keep-alive, the remote end might silently close the connection due to timeout. There is no way to know if the connection has failed in this way, generally, except when you send a request and wait for a response, the socket might reach EOF. The problem with this is, let's say you are a proxy, and you are sending a non-idempotent request (e.g. credit card payment POST). If you know the connection has timed out, but still seems open, you would close it and make a new connection for the request. But HTTP/1 doesn't have a way for the server to indicate the connection has timed out. So the client sent the request and it basically fails immediately, and because it's non-idempotent, the client doesn't know if the server processed it or not (could have received the request and failed for some other reason). HTTP/2 fixes this by having the GOAWAY frame. When the connection times out, GOAWAY is issued and the client knows not to send any more requests. It's still a little bit racey but it's better than nothing. I guess the point of WebSockets having an explicit close frame is similar to the above issue - HTTP/1 doesn't have any explicit timeout, and people expect WebSockets to live for a long time, potentially. If the HTTP/1 server wants to close the connection, it can do so gracefully. I guess, it's less useful for the client to send such a message though, since it can simply close the connection. In terms of how we handle this, I think it's confusing to have a That all being said, probably the best way to handle this is as follows:
So, something like this: def handle_connection(websocket)
# Message loop
while message = websocket.next_message
process(message) or break
end
# This is invoked if the code is gracefully shutting down:
websocket.close
end
def accept(peer)
handle_connection(WebSocket::Driver.new(peer))
ensure
peer.close
end There is one thing to consider in this code. In |
Opened another issue with a stack trace, but unsure whether it's fatal: #260. Definitely seeing failures to reconnect around that time. |
I'm seeing production bots not reconnecting again, but no errors. Reopening for now in case anyone else is seeing this and will debug more. |
So, ping is working but the bot is disconnected? |
I can't tell. The volume of connect/disconnects is too high and I couldn't confirm whether the disconnect is detected or not for a team that ends up in a disconnected state. I opened #266 to be able to match up the logs between a connect and disconnect. I can confirm that in the last couple of days my largest production instance has gotten several teams in a disconnected state, while other teams are doing fine, cause customers have complained. |
Oh geez, it's back?!? I'm tied up at Pycon, but I'll see if I can get some others to at least help with #266. As soon as I can, I'll jump back in. |
I do see this still, much rarer now, but still. |
I am seeing a customer report of this once a day now. If someone wants to help, my plan is to put back a ping worker that can identify broken connections and dump their state. Please feel free to jump in. |
Does anyone know if the slack API supports HTTP/2 web sockets? |
Here are the two examples. https://github.com/socketry/async-slack/tree/master/examples/alive The slack level ping ( The websocket level ping ( |
You need to set |
Okay, I've integrated the slack pinger into Would be great to have someone else check this. Especially someone from slack. |
@RodneyU215 and @Roach work at Slack, but I bet they are still at the post-IPO party :) |
We should keep this thread to the implementation in slack-ruby-client. The last changes in #262 made things a lot better, but our code here still doesn't handle keeping the connection alive in all cases, otherwise I wouldn't be seeing disconnects. Again, I didn't debug, but I am happy to outline what I would do if I were committed to fixing this.
|
All I can say is that based on my testing, and those two example scripts reproduce, using websocket level keep alive does not work. So are you using that or are you using the slack level ping/pong? |
We use Slack ping. But I don't think the answer to this matters because the problem is that we don't reconnect 100% when the connection is dropped from Slack side, not how often this happens. |
Okay that looks good, so at least it won’t be the problem I described above. I’m still a bit perplexed why websocket ping won’t work, but clearly the issue is elsewhere. |
Sorry if this came across the wrong way. I tried reworking |
Thanks for jumping in @ioquatix ! |
I want to run the keep-alive example for 48 hours. I want to see if it stays connected that entire time or not. Out of curiosity, do you have any operation that could block the run loop for more than 2-3 minutes? Because maybe that cause disconnect. If ping frame can't be sent for some reason. Then it would disconnect. |
Definitely not. Maybe 1 second max. |
Okay, so I've been running this for almost 2 days now:
Not a single issue, the connection has been rock solid. Bearing in mind, I'll leave it running until it bombs. So, I find I hard to believe that in normal circumstances, a bot would disconnect randomly. At least, something serious must go wrong. There either needs to be:
I suggest the simplest way to rule out some of the above is by making a super simple example script using |
One other issue I ran into recently with I don't know if this is an issue with async-websocket as used by this code, but we could try adding a semaphore around the writer, e.g.
This ensures that only one task can write to the network at a time. If you don't have something like this and multiple tasks are writing frames, yo might get some frames writing partially, then a different frame, then the original frame can write some more, chaos. |
Now that |
@ioquatix any help will be appreciated, I am just cheerleading |
Anyone still having this problem, please give the code in |
That's awesome! |
Since a few days ago I am having another Reconnecting issue that I didn't experience before:
Is this the same issue and will be solved with 0.14.3? |
When you say "in the end", it stops reconnecting then? |
I mean after retrying a few times and reconnecting the execution is aborted and I get #. I talked to slack and they are also taking a look |
0.14.4 has been running flawlessly on my production bots for a while, closing this |
@MarioRuiz I think your problem is different.
|
I think it was related since I updated from 0.14.3 to 0.14.4 and the issue it seems to be fixed (for the moment), if I see it reproduced I open a new issue. |
Since I got rid of celluloid and start using async-websocket the disconnections still are in there but now the library managed to resolve the issues and reconnect successfully |
I have at least once instance where https://github.com/dblock/slack-sup disconnected a team and never reconnected it (nothing in logs). This is with all the fixes in #208 with 0.14.1. Restarting the bot helps obviously.
The text was updated successfully, but these errors were encountered: