Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing error handling for socket timeouts that occur due to a race-li… #87

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

StabbyCutyou
Copy link

…ke condition

@bpot So, bare with me here...

During testing, we found if the connection idles for long periods of time, you can run into a case where an exception occurs, that is not handled correctly. It would appear as though, in the time between IO.select and @socket.write (and I'm assuming read as well, because why not), the socket actually timesout. This causes an ERRNO::ETIMEDOUT to be thrown, but not caught.

I tried to run your integration specs but kept getting a file not found issue after supplying the directory to my kafka installation.

@StabbyCutyou
Copy link
Author

I have a really dumb looking test script that I'm able to reliably reproduce the issue with.

https://gist.github.com/StabbyCutyou/e0050d3b8b12c7c42736

I seem to be able to get it every 3rd run of the loop, but your mileage may vary. You'll know it happens when the stack trace ERRNO::ETIMEDOUT shows up. I have another branch with some extra logging I could link you to that'll dump some info out in the connection.rb class during each attempt to publish, I used it to verify what was happening.

Again - super weird case, but one that I'm able to reproduce.

EDIT

It switches from once every twenty minutes to 100 messages, each every 100ms to try and reproduce a behavior others had seen where the connection remaining in a bad state for several writes, but I couldn't reproduce that. The script is definitely the result of some random testing approaches.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 92.35% when pulling 0b00f74 on Tapjoy:fix/missing_timeout_exception_handling into dd74d94 on bpot:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants