Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky parallel/test-child-process-fork-regr-gh-2847 #8950

Closed
mscdex opened this issue Oct 6, 2016 · 5 comments · Fixed by #8954
Closed

Investigate flaky parallel/test-child-process-fork-regr-gh-2847 #8950

mscdex opened this issue Oct 6, 2016 · 5 comments · Fixed by #8954
Labels
child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests.

Comments

@mscdex
Copy link
Contributor

mscdex commented Oct 6, 2016

  • Version: master
  • Platform: aix
  • Subsystem: child_process

This test failure recently occurred on AIX: https://ci.nodejs.org/job/node-test-commit-aix/1317/nodes=aix61-ppc64/console

Output:

not ok 74 parallel/test-child-process-fork-regr-gh-2847
# /home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-child-process-fork-regr-gh-2847.js:58
#         throw err;
#         ^
# 
# Error: channel closed
#     at ChildProcess.target.send (internal/child_process.js:540:16)
#     at Worker.send (cluster.js:65:28)
#     at Socket.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-child-process-fork-regr-gh-2847.js:33:14)
#     at Socket.g (events.js:291:16)
#     at emitNone (events.js:86:13)
#     at Socket.emit (events.js:185:7)
#     at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1075:10)
  ---
  duration_ms: 0.380

/cc @mhdawson ?

@mscdex mscdex added child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. labels Oct 6, 2016
@gibfahn
Copy link
Member

gibfahn commented Oct 6, 2016

cc/ me, @gireeshpunathil

@santigimeno
Copy link
Member

Can reproduce it on FreeBSD too

@gireeshpunathil
Copy link
Member

This has a long history, and to put it short: the second send() call is vulnerable due to race condition in the send() function with the server close. Depending on the timing, we will see failure in establishing a connection (line 32), or failure in sending data (line33).

The former seem to have been taken care(line 57), while the failure at send is not.

@santigimeno , what are your thoughts?

@santigimeno
Copy link
Member

santigimeno commented Oct 6, 2016

@gireeshpunathil I've seen a couple of different errors:

  • It can happen that the error is not ECONNREFUSED but channel closed. That's the error I think you're describing, and it should be allowed.
  • If the second socket connects before the first socket, as the first socket has not defined callback, it will emit an error instead of using the callback, thus causing a failure. This I have tried to fix in test: fix test-child-process-fork-regr-gh-2847 #8954.

I'll add a fixup for the first case to that PR.

@gireeshpunathil
Copy link
Member

OK, I missed to see your PR. Yes - your point 1 is what I think has happened in this case. If you are accommodating in the PR great, thanks!

santigimeno added a commit to santigimeno/node that referenced this issue Oct 8, 2016
It's not guaranteed that the first socket that tries to connect is the
first that succeeds so the rest of assumptions made in the test are not
correct.
Fix it by making sure the second socket does not try to connect until
the first has succeeded.
The IPC channel can already be closed when sending the second socket. It
should be allowed.
Also, don't start sending messages until the worker is online.

Fixes: nodejs#8950
PR-URL: nodejs#8954
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
jasnell pushed a commit that referenced this issue Oct 10, 2016
It's not guaranteed that the first socket that tries to connect is the
first that succeeds so the rest of assumptions made in the test are not
correct.
Fix it by making sure the second socket does not try to connect until
the first has succeeded.
The IPC channel can already be closed when sending the second socket. It
should be allowed.
Also, don't start sending messages until the worker is online.

Fixes: #8950
PR-URL: #8954
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Fishrock123 pushed a commit that referenced this issue Oct 11, 2016
It's not guaranteed that the first socket that tries to connect is the
first that succeeds so the rest of assumptions made in the test are not
correct.
Fix it by making sure the second socket does not try to connect until
the first has succeeded.
The IPC channel can already be closed when sending the second socket. It
should be allowed.
Also, don't start sending messages until the worker is online.

Fixes: #8950
PR-URL: #8954
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
MylesBorins pushed a commit that referenced this issue Nov 22, 2016
It's not guaranteed that the first socket that tries to connect is the
first that succeeds so the rest of assumptions made in the test are not
correct.
Fix it by making sure the second socket does not try to connect until
the first has succeeded.
The IPC channel can already be closed when sending the second socket. It
should be allowed.
Also, don't start sending messages until the worker is online.

Fixes: #8950
PR-URL: #8954
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: Brian White <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants