Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent socket hangups #116

Closed
tj opened this issue Nov 13, 2012 · 51 comments
Closed

frequent socket hangups #116

tj opened this issue Nov 13, 2012 · 51 comments

Comments

@tj
Copy link
Contributor

tj commented Nov 13, 2012

<3 node

Error: socket hang up
    at createHangUpError (http.js:1264:15)
    at CleartextStream.socketCloseListener (http.js:1315:23)
    at CleartextStream.EventEmitter.emit [as emit] (events.js:126:20)
    at SecurePair.destroy (tls.js:938:22)
    at process.startup.processNextTick.process._tickCallback [as _tickCallback] (node.js:244:9)
---------------------------------------------
    at registerReqListeners (/home/vagrant/projects/thumbs/node_modules/knox/lib/client.js:38:7)
    at Client.Client.putStream [as putStream] (/home/vagrant/projects/thumbs/node_modules/knox/lib/client.js:264:3)
    at Client.putFile (/home/vagrant/projects/thumbs/node_modules/knox/lib/client.js:232:20)
    at Object.oncomplete (fs.js:297:15)

looking into it, seems ridiculous to accuse s3 here, but it wouldn't surprise me either way

@domenic
Copy link
Contributor

domenic commented Nov 13, 2012

Yeah we've run into this very sporadically as well. Not really sure where the blame goes either.

@tj
Copy link
Contributor Author

tj commented Nov 13, 2012

my gut says node, because this would be completely unacceptable availability for such a service, 1/5 concurrent requests fails, but if it really is s3's backlog denying connections or similar then.. wtf.. lol I know some kernels will silently drop denied sockets without any notice so it could be that

@domenic
Copy link
Contributor

domenic commented Dec 25, 2012

I just found a "solution" on the nodejs mailing list that I'd never seen before:

https://groups.google.com/d/msg/nodejs/kYnfJZeqGZ4/uHVOfFneroAJ

If someone gets this reproducibly it's worth trying some of those fixes.

@tj
Copy link
Contributor Author

tj commented Dec 25, 2012

forgot about this, closing until we're running this portion in prod because it might just be my crappy local canadian connection haha

@tj tj closed this as completed Dec 25, 2012
@tj
Copy link
Contributor Author

tj commented Feb 8, 2013

still a problem in prod, either knox is busted, or node is busted. I'll try and take a closer look at the packets soon and see wtf is going on

@tj tj reopened this Feb 8, 2013
@domenic
Copy link
Contributor

domenic commented Feb 15, 2013

Some of the recent changes in Node 0.8.20 look related; might be worth giving it a shot.

@tj
Copy link
Contributor Author

tj commented Feb 15, 2013

oh really? which ones?

@domenic
Copy link
Contributor

domenic commented Feb 15, 2013

From http://blog.nodejs.org/2013/02/15/node-v0-8-20-stable/

http: Do not let Agent hand out destroyed sockets (isaacs)
http: Raise hangup error on destroyed socket write (isaacs)

Hmm not sure.

@tj
Copy link
Contributor Author

tj commented Feb 15, 2013

hmm worth a try ill update our node

@tj
Copy link
Contributor Author

tj commented Feb 15, 2013

no dice

@7Ds7
Copy link

7Ds7 commented Feb 20, 2013

This is not just happening with knox but anything with socket.io.

I am not a node expert by any means, but i reproduce this error when a client closes the window and does not warn socket.

ie: Android v4.0.2, closing a tab that is listening to sockets on the tab manager, does not send a disconnect or window.onbeforeunload event, not doing so, makes another request from another browser try to send to that hung up socket crashing the server with "socket hang up" error

@domenic
Copy link
Contributor

domenic commented Feb 20, 2013

@7Ds7 that particular case is pretty much expected behavior, as outlined here. If the user hangs up on your socket, you of course will get a socket hang up error.

I'm pretty sure Knox either returns event emitters that you can listen to the "error" event on, or properly catches all "error" events and transforms them into err parameters to the callback. Willing to be proved wrong though.

@danmilon
Copy link

I'm able to reproduce this in our code. Didn't have time to isolate in a test case yet.
Apparently S3 is not very fond of keeping idle connections alive.

< <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>AFB17A5C4F0A0B56</RequestId><HostId>zOLEoTDEZg8QAY9ZVzBTPNpFNvFBxXd5J1E62slzLuollhMHpLztnK0Z2aHuXi40</HostId></Error>

@rauchg
Copy link
Contributor

rauchg commented Feb 23, 2013

Wow interesting find Dan

On Fri, Feb 22, 2013 at 1:57 PM, Dan Milon [email protected] wrote:

I'm able to reproduce this in our code. Didn't have time to isolate in a
test case yet.
Apparently S3 is not very fond of keeping idle connections alive.

< RequestTimeoutYour socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.AFB17A5C4F0A0B56zOLEoTDEZg8QAY9ZVzBTPNpFNvFBxXd5J1E62slzLuollhMHpLztnK0Z2aHuXi40


Reply to this email directly or view it on GitHubhttps://github.com//issues/116#issuecomment-13973408.

Guillermo Rauch
LearnBoost CTO
http://devthought.com

@nickfishman
Copy link

I've encountered this as well, with the same timeout error reported by @danmilon.

My original use case was piping an HTTP response directly into S3 using putStream. I'd been using that for ~2 months and have never seen this issue. I ran into this for the first time today after I switched to putFile (I needed to add some local pre-processing so I now write to a temp file first). Not sure if there's a difference between putFile and putStream or it's purely coincidence.

I'm using somewhat old versions of knox (0.4.2) and Node (0.8.6). I'll update to the latest and let you all know if I see this again.

@kof
Copy link
Contributor

kof commented Mar 18, 2013

I have this error reproducible very stable using node v0.8.21

And I think I have nailed the issue: it happens if the maxSockets of the agent is lower than the amount of requests we are doing.

If I set https.globalAgent.maxSockets = 50; and do 50 parallel requests - after some seconds the error will be there.

If I do 40 parallel requests - I am able to download thousands of files from S3.

Possible solutions:

  1. I think first of all it is a documentation issue on node as well as knox. Both of them should mention, that the default Agent has maxSockets == 5. Node should mention this not only in place where maxSocket option is described but also in 3-4 other places where the users read how to create requests.
  2. Knox could set for its engine the maxSockets value to something much more higher than 5, f.e. 500, because knox will be often used with multiple connections per host. Also knox could expose and document maxSockets option which is then forwarded to the Agent.

@kof
Copy link
Contributor

kof commented Mar 18, 2013

Possibly node could throw something more meaningful than socket hang up in this special case?

@kof
Copy link
Contributor

kof commented Mar 18, 2013

or the queueing logic is somewhere wrong in node ...?

@danmilon
Copy link

@kof, could you share the code to reproduce this?

@kof
Copy link
Contributor

kof commented Mar 19, 2013

its a script with some dependencies to the main project, I need to reduce it to the pure reproducible snippet .... but I can post the original script if somebody wants to play with it.

@kof
Copy link
Contributor

kof commented Mar 19, 2013

setting agent=false solves the issue for me too.

@domenic
Copy link
Contributor

domenic commented Mar 19, 2013

setting agent=false? Can you explain?

@substack says the problem can be solved by calling https.request with { pool: false }. That might be the way to go for Knox? Or at least make it an option that is on by default?

@kof
Copy link
Contributor

kof commented Mar 19, 2013

I mean agent=false on request options:

https://github.com/LearnBoost/knox/blob/master/lib/client.js#L139

http://nodejs.org/docs/v0.8.21/api/all.html#all_http_request_options_callback

There is no documented option pool=false, but agent=false will do exactly this thing:

"false: opts out of connection pooling with an Agent, defaults request to Connection: close."

This will fix the issue, while it will be possible to open unlimited amount of sockets to the same host, where one can run into OS limits.

I suppose this is the reason why maxSockets option exists.

  1. I don't understand why nodes default is that way low (5)
  2. It seems like there is an error in nodes pool logic

It looks like the right way would be to pass an own Agent instance with a good default maxSockets value, which will work for all OSs, but when the limit is reiched, nodes pull logic will be an issue again.

@ghost
Copy link

ghost commented Mar 19, 2013

I misspoke earlier. { agent: false } is the correct thing. You should basically never, ever use anything other than { agent: false } in any program ever. The default value in core is completely absurd.

@andrewrk
Copy link
Contributor

@substack can you link to more information about this? sounds like you're onto something but obviously the casual reader should look more into it before taking your word at face value

@tj
Copy link
Contributor Author

tj commented Mar 19, 2013

fwiw we're not using node for this anymore but even increasing the max sockets to a very high number fixed nothing, so it seems like the comment about it being related the pooling logic could be right

@kof
Copy link
Contributor

kof commented Apr 19, 2013

I just had the problem again. I wrote a script which has to process 17k images. Now I got it working without hangups. Hangups do still happen, but I just use retry https://npmjs.org/package/retry

Without retry I had up to 80 hangups for 17k requests.

I am not sure if knox might want to use retry in the client. But it might be a good idea.

@domenic
Copy link
Contributor

domenic commented Apr 19, 2013

@kof @puckey:

I don't understand this statement: "don't overuse a connection." knox creates for every api call a separate connection.

I think Node.js itself maintains a connection pool, and that's part of what the agent: false business disables. So that was a pretty good find, most likely.

@gabceb

That isn't actually the same problem as in this thread, which discusses "socket hang up" errors. See the OP. That is a classic error when you set an incorrect Content-Length; I'd try asking the multiparty maintainers. Maybe you're trying to upload something smaller than 5 MB? (Not sure how multiparty works.)

@addisonj

Sorry to hear you're still having this problem. I wonder if they fixed it in later versions of Node? It'd be worth trying on Node.js 0.10.4 if you can, or at least 0.8.20 per above.

@addisonj
Copy link

The crash was out fault. Simply forgetting to listen to an error event.

I will see if I can get things on 0.8.20, but for now we have something written by one of our guys, https://github.com/jergason/intimidate, which retries with back-off. We hit s3 so often I would not be surprised to see failed requests every so often but this has worked well so far.

@gabceb
Copy link

gabceb commented Apr 19, 2013

Thanks @domenic. I am following up on this issue on the multiparty repo

@domenic
Copy link
Contributor

domenic commented Apr 19, 2013

@addisonj that's awesome! Submit a pull request to add it to our "Beyond Knox" section in the readme :).

jergason added a commit to jergason/knox that referenced this issue Apr 19, 2013
Per discussion in issue Automattic#116, adding a blurb about intimidate, a wrapper for retriable uploads with exponential backoff, to the Beyond Knox section of the readme.
@stalbot
Copy link

stalbot commented Nov 18, 2013

Anyone have a good way to reproduce this? We are getting this issue occasionally (i.e. once every few days on a lightly used production machine), but I can't reproduce deliberately in any environment. I will try some of the fixes here and see if they end the issue, but it would be great to have a way to make sure.

@domenic
Copy link
Contributor

domenic commented Nov 18, 2013

@stalbot if you can create a reliable reproduce I will jump all over fixing this.

@dweinstein
Copy link
Contributor

Disabling the http agent seems like a way for some people to shoot themselves in the foot. If you're not careful and make too many requests (e.g., doing a head or headFile request) in a loop without limiting the number of simultaneous connections, you're likely to end up with something like the following result:

Possibly unhandled Error: connect EMFILE
    at errnoException (net.js:901:11)
    at connect (net.js:764:19)
    at net.js:842:9
    at asyncCallback (dns.js:68:16)
    at Object.onanswer [as oncomplete] (dns.js:121:9)

And wonder WTF....

Therefore I made sure the agent was re-enabled by doing (e.g., knoxClient.agent = require('https').globalAgent;) and instead tweaking the maxSockets field.

@domenic
Copy link
Contributor

domenic commented May 14, 2014

I assume that's on a Mac, which has a pretty horrible global limit on file descriptors?

@dweinstein
Copy link
Contributor

Yes that was on a Mac.

@jonathanong
Copy link

@dweinstein just bump your limit for maximum # of sockets

@mtharrison
Copy link

I can reproduce this by creating a 1000 empty text files touch {1..1000}.txt and then trying to push them up to S3. 90% of the time there will be a socket hangup. I also get this exact same thing with a go package I'm working on, putting each request in its own goroutine. Inspecting the tcpdump, I can see Amazon is sending an RST packet and closing the connection which returns a ECONNRESET. The only way I can think of solving this is baking in retry.

@kof
Copy link
Contributor

kof commented Jul 12, 2014

is there a limit for concurrency?

Am 12.07.2014 um 12:33 schrieb Matt Harrison [email protected]:

I can reproduce this by creating a 1000 empty text files touch {1..1000}.txt and then trying to push them up to S3. 90% of the time there will be a socket hangup. I also get this exact same thing with a go package I'm working on, putting each request in its own goroutine. Inspecting the tcpdump, I can see Amazon is sending an RST packet and closing the connection which returns a ECONNRESET. The only way I can think of solving this is baking in retry.


Reply to this email directly or view it on GitHub.

@mtharrison
Copy link

The only limit I've seen documented is the one mentioned above. But that pertains to reusing a connection, which I'm not doing.

@tj
Copy link
Contributor Author

tj commented Jul 12, 2014

FWIW s3 has a concurrent access limit on like-named prefixes. I can't find the thread but apparently due to how they store things if you have say "foo-{1,10000}" and try say 1500 concurrent requests many will fail, but if you have "{1,100000}-foo" it should be fine. This is screwing us pretty hard right now, looking at replacing s3 all together and just using Riak for our primary access and storing in s3 as a backup

@riyadhalnur
Copy link

I found this http://stackoverflow.com/questions/27392923/uploading-to-s3-with-node-knox-socket-hang-up when searching for a solution into this problem. Looks like a lot of people have this problem with our good ol' friend S3.

var req = client.putStream(res, elem._id, headers,function(err,s3res){
    if(err) console.log(err);
    console.log(s3res);
}).end();

This was the last comment. Worked for the guy who posted the solution.

@JamesTheHacker
Copy link

The following solved it for me:

var http = require('http')
http.globalAgent.maxSockets = 2048

Some suggest 1024, but I was still getting some errors. I upped to 2048 and works fine.

@Albert-IV
Copy link

Albert-IV commented Aug 3, 2016

I know this isn't StackOverflow, but for anyone coming into this issue in the future, make sure you call .end() on the knoxClient.get() call.

IE this will return a socket hangup...

client.get('s3-key')
.on('response', handleResponse)
.on('error', handleError);

and this will not

client.get('s3-key')
.on('response', handleResponse)
.on('error', handleError)
.end();

A pretty obvious coding error, but one that I usually end up running into when I add Knox to a new project. 😆

@DaGaMs
Copy link

DaGaMs commented Nov 20, 2016

FWIW, I've had this problem when I was passing the request object into a callback function in Loopback. Long story short, the only thing that "fixed" the problem was to do getFile(url, (err, res) => {callback(null, res)}) - in that way the data is streamed correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests