Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional failures to reconnect #257

Closed
dblock opened this issue Mar 5, 2019 · 57 comments
Closed

Occasional failures to reconnect #257

dblock opened this issue Mar 5, 2019 · 57 comments

Comments

@dblock
Copy link
Collaborator

dblock commented Mar 5, 2019

I have at least once instance where https://github.com/dblock/slack-sup disconnected a team and never reconnected it (nothing in logs). This is with all the fixes in #208 with 0.14.1. Restarting the bot helps obviously.

@dblock
Copy link
Collaborator Author

dblock commented Mar 5, 2019

cc: @RodneyU215 we probably still have some bugs here

@RodneyU215
Copy link
Collaborator

Thanks @dblock I'll take another look. Can you confirm which concurrency library you saw this occur with?

@dblock
Copy link
Collaborator Author

dblock commented Mar 6, 2019

Thanks @dblock I'll take another look. Can you confirm which concurrency library you saw this occur with?

Async, thanks

@dblock
Copy link
Collaborator Author

dblock commented Apr 1, 2019

Didn't debug yet but saw this stack trace when this happened recently:

E, [2019-04-01T17:29:30.721649 #12318] ERROR -- : listen loop error: Broken pipe (Errno::EPIPE)
E, [2019-04-01T17:29:30.721745 #12318] ERROR -- : /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/openssl/buffering.rb:379:in `syswrite_nonblock'
E, [2019-04-01T17:29:30.721828 #12318] ERROR -- : /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/openssl/buffering.rb:379:in `write_nonblock'
E, [2019-04-01T17:29:30.721853 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/async-io-1.18.5/lib/async/io/generic.rb:141:in `async_send'
E, [2019-04-01T17:29:30.721874 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/async-io-1.18.5/lib/async/io/generic.rb:47:in `block in wrap_blocking_method'
E, [2019-04-01T17:29:30.721910 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/async-websocket-0.7.0/lib/async/websocket/connection.rb:93:in `write'
E, [2019-04-01T17:29:30.721930 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/websocket-driver-0.7.0/lib/websocket/driver/hybi.rb:228:in `send_frame'
E, [2019-04-01T17:29:30.721950 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/websocket-driver-0.7.0/lib/websocket/driver/hybi.rb:191:in `frame'
E, [2019-04-01T17:29:30.721970 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/websocket-driver-0.7.0/lib/websocket/driver/hybi.rb:154:in `close'
E, [2019-04-01T17:29:30.721989 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/async-websocket-0.7.0/lib/async/websocket/connection.rb:97:in `close'
E, [2019-04-01T17:29:30.722009 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/slack-ruby-client-0.14.1/lib/slack/real_time/concurrency/async.rb:62:in `close'
E, [2019-04-01T17:29:30.722029 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/slack-ruby-client-0.14.1/lib/slack/real_time/client.rb:124:in `restart_async'
E, [2019-04-01T17:29:30.722049 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/slack-ruby-client-0.14.1/lib/slack/real_time/client.rb:113:in `rescue in run_ping!'
E, [2019-04-01T17:29:30.722069 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/slack-ruby-client-0.14.1/lib/slack/real_time/client.rb:107:in `run_ping!'
E, [2019-04-01T17:29:30.722089 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/slack-ruby-client-0.14.1/lib/slack/real_time/concurrency/async.rb:25:in `block (2 levels) in start_async'
E, [2019-04-01T17:29:30.722108 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/timers-4.3.0/lib/timers/timer.rb:104:in `fire'
E, [2019-04-01T17:29:30.722128 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/timers-4.3.0/lib/timers/events.rb:49:in `fire'
E, [2019-04-01T17:29:30.722148 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/timers-4.3.0/lib/timers/events.rb:90:in `fire'
E, [2019-04-01T17:29:30.722167 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/timers-4.3.0/lib/timers/group.rb:100:in `fire'
E, [2019-04-01T17:29:30.722187 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/timers-4.3.0/lib/timers/group.rb:85:in `wait'
E, [2019-04-01T17:29:30.722207 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/async-1.15.1/lib/async/reactor.rb:152:in `run'
E, [2019-04-01T17:29:30.722226 #12318] ERROR -- : /app/vendor/bundle/ruby/2.3.0/gems/slack-ruby-client-0.14.1/lib/slack/real_time/concurrency/async.rb:28:in `block in start_async'

@RodneyU215
Copy link
Collaborator

@dblock Thanks for posting those logs. I previously was unable to duplicate the problem, but these logs are super helpful. I'll try to dig back into this soon. I'm working on another release, but I should have more time by Monday.

@dblock
Copy link
Collaborator Author

dblock commented Apr 2, 2019

@RodneyU215 So in this case the ping worker connection is closing and raising Errno::EPIPE, killing the ping worker. I see a couple of fixes:

Maybe @ioquatix can pitch in. What are the expectations of ::Async::WebSocket::Client#close raising errors? What will it raise potentially? Should we be capturing and ignoring these?

@ioquatix
Copy link
Contributor

ioquatix commented Apr 3, 2019

So, those are good questions.

After looking at it, I wonder how ::Async::WebSocket::Client#close should work.

At the moment, it calls WebSocket::Driver#close but the way that works is that it tries to write data to the network, but we are already at the point when the connection is stuffed, so it's blowing up.

The question is, should it (and/or) invoke @socket.close?

The normal interpretation of close is probably @socket.close only.

But in this case, WebSocket has a close frame to gracefully close the connection... It's advisory so we could probably ignore failures.

Happy to change this behaviour.

Thoughts?

@dblock
Copy link
Collaborator Author

dblock commented Apr 3, 2019

I don't know the protocol very well, but it sounds that WebSocket::Driver#close is trying to write to the network to inform the server in a "nice to have" way. If the failure to inform the server according to the protocol is not impactful to the actual closed state of the driver, then it shouldn't blow up on the network write and it makes sense to ignore failures.

@ioquatix

  1. which library should be doing the above for a proper fix?
  2. even if (1) is made in a dependency, maybe slack-ruby-client should handle EPIPE to solve the user problem? wdyt?

@ioquatix
Copy link
Contributor

ioquatix commented Apr 4, 2019

I think the difference is between a graceful shutdown and a close. They are two different things.

There is one situation where this made sense to me. In HTTP/1, with keep-alive, the remote end might silently close the connection due to timeout. There is no way to know if the connection has failed in this way, generally, except when you send a request and wait for a response, the socket might reach EOF. The problem with this is, let's say you are a proxy, and you are sending a non-idempotent request (e.g. credit card payment POST). If you know the connection has timed out, but still seems open, you would close it and make a new connection for the request. But HTTP/1 doesn't have a way for the server to indicate the connection has timed out. So the client sent the request and it basically fails immediately, and because it's non-idempotent, the client doesn't know if the server processed it or not (could have received the request and failed for some other reason).

HTTP/2 fixes this by having the GOAWAY frame. When the connection times out, GOAWAY is issued and the client knows not to send any more requests. It's still a little bit racey but it's better than nothing.

I guess the point of WebSockets having an explicit close frame is similar to the above issue - HTTP/1 doesn't have any explicit timeout, and people expect WebSockets to live for a long time, potentially. If the HTTP/1 server wants to close the connection, it can do so gracefully. I guess, it's less useful for the client to send such a message though, since it can simply close the connection.

In terms of how we handle this, I think it's confusing to have a CloseEvent and close(fd) system call as hopefully now you can see they serve different purposes and I don't even know if it's something that should be done automatically - e.g. GOAWAY may be sent explicitly but the connection is kept open to service existing requests (HTTP/2 multiplexing), but no new request are accepted, etc. The same might apply to WebSocket - a CloseEvent being sent by one end might not actually signal the complete termination of the connection, but just an intention to do so. Because network has latency, the client may still be sending messages before receiving the close event...

That all being said, probably the best way to handle this is as follows:

  • If the user explicitly wants to close the connection, invoke WebSocket::Driver#close (via Async::WebSocket::Connection#close), then the user would exit the event handling loop and that naturally has an ensure socket.close somewhere which closes the underling IO.
  • If an error occurs, the user shouldn't be invoking WebSocket::Driver#close, but the error propagation should hit ensure socket.close which closes the file descriptor only.

So, something like this:

def handle_connection(websocket)
	# Message loop
	while message = websocket.next_message
		process(message) or break
	end
	
	# This is invoked if the code is gracefully shutting down:
	websocket.close
end

def accept(peer)
	handle_connection(WebSocket::Driver.new(peer))
ensure
	peer.close
end

There is one thing to consider in this code. In Async, if an exception propagates up to accept, it is logged and the task is finished. So, it's not a problem with websocket.close fails due to some kind of error. The most important thing is to call peer.close because otherwise a file descriptor would be leaked. Whether websocket.close works or not is less important.

@dblock
Copy link
Collaborator Author

dblock commented Apr 5, 2019

Opened another issue with a stack trace, but unsure whether it's fatal: #260. Definitely seeing failures to reconnect around that time.

@dblock
Copy link
Collaborator Author

dblock commented Apr 5, 2019

I PRed #261, honestly I am not sure it's a good idea, I would appreciate a better fix from you @ioquatix, but it's above my grade I think ;)

I'm also not observing any more E_PIPE errors in my bots.

dblock added a commit to dblock/slack-ruby-client that referenced this issue Apr 8, 2019
@RodneyU215
Copy link
Collaborator

@dblock I'm just getting caught back up on this. @ioquatix's PR looks great! 😍🎉

dblock added a commit that referenced this issue Apr 8, 2019
@dblock
Copy link
Collaborator Author

dblock commented Apr 8, 2019

@dblock I'm just getting caught back up on this. @ioquatix's PR looks great! 😍🎉

Someone at SlackHQ should send @ioquatix some socks at least!

@dblock
Copy link
Collaborator Author

dblock commented Apr 25, 2019

I'm seeing production bots not reconnecting again, but no errors. Reopening for now in case anyone else is seeing this and will debug more.

@dblock dblock reopened this Apr 25, 2019
@ioquatix
Copy link
Contributor

So, ping is working but the bot is disconnected?

@dblock
Copy link
Collaborator Author

dblock commented Apr 25, 2019

I can't tell. The volume of connect/disconnects is too high and I couldn't confirm whether the disconnect is detected or not for a team that ends up in a disconnected state. I opened #266 to be able to match up the logs between a connect and disconnect.

I can confirm that in the last couple of days my largest production instance has gotten several teams in a disconnected state, while other teams are doing fine, cause customers have complained.

@RodneyU215
Copy link
Collaborator

Oh geez, it's back?!? I'm tied up at Pycon, but I'll see if I can get some others to at least help with #266. As soon as I can, I'll jump back in.

@dblock
Copy link
Collaborator Author

dblock commented May 6, 2019

Oh geez, it's back?!? I'm tied up at Pycon, but I'll see if I can get some others to at least help with #266. As soon as I can, I'll jump back in.

I do see this still, much rarer now, but still.

@dblock
Copy link
Collaborator Author

dblock commented May 17, 2019

I am seeing a customer report of this once a day now.

If someone wants to help, my plan is to put back a ping worker that can identify broken connections and dump their state. Please feel free to jump in.

@ioquatix
Copy link
Contributor

Does anyone know if the slack API supports HTTP/2 web sockets?

@ioquatix
Copy link
Contributor

Here are the two examples.

https://github.com/socketry/async-slack/tree/master/examples/alive

The slack level ping (slack.rb) will run indefinitely.

The websocket level ping (websocket.rb) will systematically die after 3 minutes with a pretty nasty failure (EOF). Normally for HTTP/2, the goaway frame is used to gracefully close the underlying connections, but this is pretty much just close(fd) on the remote end..

@ioquatix
Copy link
Contributor

You need to set SLACK_TOKEN to a valid token to run those examples. Please try it and report back (everyone included!).

@ioquatix
Copy link
Contributor

Okay, I've integrated the slack pinger into async-slack. Maybe that works around the problem, at least in my case:

https://github.com/socketry/async-slack/blob/2f32b081e0f80122d5de2ca35e0085bb1c18a255/lib/async/slack/real_time.rb#L47-L57

Would be great to have someone else check this. Especially someone from slack.

@dblock
Copy link
Collaborator Author

dblock commented Jun 22, 2019

@RodneyU215 and @Roach work at Slack, but I bet they are still at the post-IPO party :)

@dblock
Copy link
Collaborator Author

dblock commented Jun 22, 2019

We should keep this thread to the implementation in slack-ruby-client. The last changes in #262 made things a lot better, but our code here still doesn't handle keeping the connection alive in all cases, otherwise I wouldn't be seeing disconnects. Again, I didn't debug, but I am happy to outline what I would do if I were committed to fixing this.

  • Get a production, long running bot
  • Add a million lines of logging in async.rb and client.rb in slack-ruby-client to trace the connection, especially here.
  • Add an external worker ping, modify Revert #93, bring back the ping worker. slack-ruby-bot-server#103 not to close the connection when ping dies to identify the moment when the problem has occurred (slack-ruby-client thinks everything is cool, but slack status says otherwise).
  • When the external worker says that the client disconnected after say a minute, sound alarms, look at logs and try to make sense of it.

@ioquatix
Copy link
Contributor

All I can say is that based on my testing, and those two example scripts reproduce, using websocket level keep alive does not work.

So are you using that or are you using the slack level ping/pong?

@dblock
Copy link
Collaborator Author

dblock commented Jun 22, 2019

We use Slack ping. But I don't think the answer to this matters because the problem is that we don't reconnect 100% when the connection is dropped from Slack side, not how often this happens.

@ioquatix
Copy link
Contributor

Okay that looks good, so at least it won’t be the problem I described above. I’m still a bit perplexed why websocket ping won’t work, but clearly the issue is elsewhere.

@ioquatix
Copy link
Contributor

ioquatix commented Jun 22, 2019

We should keep this thread to the implementation in slack-ruby-client.

Sorry if this came across the wrong way. I tried reworking slack-ruby-client but the design is just too complicated and I gave up. I wanted to confirm where the problem was by a process of reduction - it only took an hour to whack out the examples, and now at least we've got confirmation that the approach of using slack's ping/pong messages should be sufficient, and that using websockets ping/pong frames is definitely not sufficient and/or buggy.

@dblock
Copy link
Collaborator Author

dblock commented Jun 22, 2019

Thanks for jumping in @ioquatix !

@ioquatix
Copy link
Contributor

I want to run the keep-alive example for 48 hours. I want to see if it stays connected that entire time or not.

Out of curiosity, do you have any operation that could block the run loop for more than 2-3 minutes? Because maybe that cause disconnect. If ping frame can't be sent for some reason. Then it would disconnect.

@dblock
Copy link
Collaborator Author

dblock commented Jun 23, 2019

Out of curiosity, do you have any operation that could block the run loop for more than 2-3 minutes?

Definitely not. Maybe 1 second max.

@ioquatix
Copy link
Contributor

Okay, so I've been running this for almost 2 days now:

2821m11s     info: main [pid=11301]
                 | Sending Slack Ping Frame 33827...
2821m11s     info: main [pid=11301]
                 | message: {:type=>"pong", :reply_to=>33827}
2821m16s     info: main [pid=11301]
                 | Sending Slack Ping Frame 33828...
2821m16s     info: main [pid=11301]
                 | message: {:type=>"pong", :reply_to=>33828}

Not a single issue, the connection has been rock solid. Bearing in mind, async-slack is almost an entirely fresh implementation. So, it's not using websocket-driver and it's not running over HTTP/1.

I'll leave it running until it bombs.

So, I find I hard to believe that in normal circumstances, a bot would disconnect randomly. At least, something serious must go wrong. There either needs to be:

  1. Some kind of network failure - e.g. TCP connection is interrupted.
  2. Some kind of framing failure - WebSocket connection is corrupted some how. Maybe there is a bug in websocket-driver for example.
  3. Some kind of application logic failure.

I suggest the simplest way to rule out some of the above is by making a super simple example script using websocket-driver and just pinging the slack service. Maybe some protocol level difference is causing the problem. At least if you rule it out, you know the problem is somewhere in the application layer of the code.

@ioquatix
Copy link
Contributor

One other issue I ran into recently with async-http was the lack of a semaphore around writing frames. Because of that, it became possible for multiple tasks to try and write frames and for those frames to become interleaved.

I don't know if this is an issue with async-websocket as used by this code, but we could try adding a semaphore around the writer, e.g.

@write_gurd = Async::Semaphore.new(1)

def write_frame(frame
  @write_guard.acquire do
    super(frame)
  end
end

This ensures that only one task can write to the network at a time.

If you don't have something like this and multiple tasks are writing frames, yo might get some frames writing partially, then a different frame, then the original frame can write some more, chaos.

@ioquatix
Copy link
Contributor

Now that async-websocket has stabilised, I can probably update the implementation here.

@dblock
Copy link
Collaborator Author

dblock commented Jun 25, 2019

@ioquatix any help will be appreciated, I am just cheerleading

@dblock
Copy link
Collaborator Author

dblock commented Jul 20, 2019

Alright, I was able to get a very simple repro of at least one obvious scenario by pulling the network cable in #280. A naive fix is in #279, can't think of anything wrong with it, but maybe @ioquatix could share his $.02?

@dblock
Copy link
Collaborator Author

dblock commented Jul 21, 2019

Anyone still having this problem, please give the code in master a try. I will be releasing it as 0.14.3.

@ioquatix
Copy link
Contributor

That's awesome!

@MarioRuiz
Copy link

MarioRuiz commented Jul 26, 2019

Since a few days ago I am having another Reconnecting issue that I didn't experience before:

I, [2019-07-26T13:32:23.720843 #3377]  INFO -- : Connection closing, exiting. 2019-07-26 13:32:23 +0000
I, [2019-07-26T13:32:23.721218 #3377]  INFO -- : Connection has been disconnected. 2019-07-26 13:32:23 +0000
I, [2019-07-26T13:32:25.245729 #3377]  INFO -- : Successfully connected, welcome 'smart-bot'. 
I, [2019-07-26T13:34:23.807918 #3377]  INFO -- : Connection closing, exiting. 2019-07-26 13:34:23 +0000
I, [2019-07-26T13:34:23.808461 #3377]  INFO -- : Connection has been disconnected. 2019-07-26 13:34:23 +0000
I, [2019-07-26T13:34:25.312489 #3377]  INFO -- : Successfully connected, welcome 'smart-bot'.
I, [2019-07-26T13:36:23.910178 #3377]  INFO -- : Connection closing, exiting. 2019-07-26 13:36:23 +0000
I, [2019-07-26T13:36:23.910503 #3377]  INFO -- : Connection has been disconnected. 2019-07-26 13:36:23 +0000
I, [2019-07-26T13:36:25.401463 #3377]  INFO -- : Successfully connected, welcome 'smart-bot'.
I, [2019-07-26T13:39:24.049872 #3377]  INFO -- : Connection closing, exiting. 2019-07-26 13:39:24 +0000
I, [2019-07-26T13:39:24.050466 #3377]  INFO -- : Connection has been disconnected. 2019-07-26 13:39:24 +0000
I, [2019-07-26T13:39:25.396382 #3377]  INFO -- : Successfully connected, welcome 'smart-bot'.

Is this the same issue and will be solved with 0.14.3?
It is regularly disconnecting and connecting again.
In the end, I get on _data: #<struct WebSocket::Driver::CloseEvent code=1001, reason="server closed connection">

@dblock
Copy link
Collaborator Author

dblock commented Jul 26, 2019

When you say "in the end", it stops reconnecting then?

@MarioRuiz
Copy link

I mean after retrying a few times and reconnecting the execution is aborted and I get #. I talked to slack and they are also taking a look

@dblock
Copy link
Collaborator Author

dblock commented Jul 29, 2019

0.14.4 has been running flawlessly on my production bots for a while, closing this

@dblock dblock closed this as completed Jul 29, 2019
@dblock
Copy link
Collaborator Author

dblock commented Jul 29, 2019

@MarioRuiz I think your problem is different.

  1. Upgrade to 0.14.4 so we can talk about the same code.
  2. Open a new issue when you see it reproduced with all the latest logs.

@MarioRuiz
Copy link

I think it was related since I updated from 0.14.3 to 0.14.4 and the issue it seems to be fixed (for the moment), if I see it reproduced I open a new issue.
I talked to Slack and they didn't see any disconnection on their logs but they see the connections performed again, so in the end, the previous connection was still open and that caused the rate limits to be reached. Is it necessary to force a disconnection when the close event is fired?

@MarioRuiz
Copy link

Since I got rid of celluloid and start using async-websocket the disconnections still are in there but now the library managed to resolve the issues and reconnect successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants