Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

child_process: emit .send() errors in next tick #4277

Closed
wants to merge 1 commit into from

Conversation

phillipj
Copy link
Member

Triaging a couple of FIXMEs from @bnoordhuis in internal/child_process.

Should this be considered a fix for the next major release as it changes from emitting errors synchronously, to async in the next tick?

@@ -510,7 +510,7 @@ function setupChannel(target, channel) {
if (typeof callback === 'function') {
process.nextTick(callback, ex);
} else {
this.emit('error', ex); // FIXME(bnoordhuis) Defer to next tick.
process.nextTick(emitErrorNT, this, ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if it's worthwhile to create the separate emitErrorNT function.

This ought to work:

process.nextTick(()=>this.emit('error', ex)); 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, more explicit 👍 Saw this in a lot of other modules, so thought it was the preferred pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're starting to move in the direction of using arrow functions where possible to improve readability and, in some cases, performance. Either approach would work here but the arrow function just seems more explicit

@mscdex mscdex added the child_process Issues and PRs related to the child_process subsystem. label Dec 14, 2015
@Fishrock123 Fishrock123 added the semver-major PRs that contain breaking changes and should be released in the next major version. label Dec 21, 2015
@Fishrock123
Copy link
Contributor

Should this be considered a fix for the next major release as it changes from emitting errors synchronously, to async in the next tick?

Yep.

@@ -672,7 +672,6 @@ function setupChannel(target, channel) {
return control;
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary whitespace change.

@trevnorris
Copy link
Contributor

One comment but LGTM otherwise if CI is happy.

CI: https://ci.nodejs.org/job/node-test-pull-request/1095/

@phillipj
Copy link
Member Author

Updated, thanks for reviewing!

CI didn't pass, I couldn't find any hints in the Jenkins console output.. Any other places to look for hints?

@trevnorris
Copy link
Contributor

@phillipj Here's one of the errors (from https://ci.nodejs.org/job/node-test-binary-windows/416/RUN_SUBSET=0,VS_VERSION=vs2015,label=win10/tapTestReport/test.tap-15/)

  not ok 15 test-child-process-send-err.js
# c:\workspace\node-test-binary-windows\RUN_SUBSET\0\VS_VERSION\vs2015\label\win10\test\parallel\test-child-process-send-err.js:27
# channel.open(stream);
# ^
# 
# Error: EINVAL: invalid argument, uv_pipe_open
# at Error (native)
# at Object.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\0\VS_VERSION\vs2015\label\win10\test\parallel\test-child-process-send-err.js:27:9)
# at Module._compile (module.js:408:26)
# at Object.Module._extensions..js (module.js:415:10)
# at Module.load (module.js:354:32)
# at Function.Module._load (module.js:311:12)
# at Function.Module.runMain (module.js:440:10)
# at startup (node.js:141:18)
# at node.js:997:3

I checked previous CI runs and this wasn't failing in those. So I'm going to assume it's related, but not sure why.

@phillipj
Copy link
Member Author

Thanks @trevnorris. I'll get a win10 machine setup and see if I can reproduce that failure.

@phillipj
Copy link
Member Author

phillipj commented Jan 7, 2016

Alright, so luckily no problem in reproducing the windows failure. uv_pipe_open in libuv/win/pipe.c are quite different from its libuv/unix/pipe.c equivalent.

Other than banging my head against the wall, I haven't made much progress. I've tried to figure out what exactly the uv_file parameter of uv_pipe_open() expects. In test-child-process-send-err.js I'm currently sending in an common.ArrayStream, which is obviously not a file handle. I have tried sending in an actual file handle, created with fs.openSync(), which sadly doesn't pass the asserts in win/pipe.c either.

Me being a C/C++ rookie in combination with not knowing Visual Studio much, doesn't help either. I've got VS set up building node from source, but I haven't figured out how to run test-child-process-send-err.js while setting a break point in win/pipe.c - which is a breeze in xcode..

Would appreciate any pointers in the right direction here! E.g. what I should pass as the uv_file argument, which is probably really simple, or how to get Visual Studio too behave :)

@jasnell
Copy link
Member

jasnell commented Jan 15, 2016

@phillipj ... any luck on this so far? @trevnorris @bnoordhuis ... any ideas?

@phillipj
Copy link
Member Author

@jasnell afraid not. Just pushed an update to clarify what I was trying to do with fs.openSync() to get a hold of a file descriptor which would pass the assertions in libuv/win/pipe.c - no luck so far.

@saghul
Copy link
Member

saghul commented Jan 15, 2016

What uv_pipe_open gets must be a valid pipe or it won't pass the windows handle duplication code IIRC.

@phillipj
Copy link
Member Author

Thanks @saghul! I'll give it a shot

@phillipj
Copy link
Member Author

Finally got the tests passing locally on windows 🎆

After @saghul's comment and code in libuv hinting about a parent process had to exist in windows to get this to pass, I did a test refactor to actually get valid IPCs created. Long story short, the tests now use child_process.fork() and run the tests inside the children.

Could you do a CI run again, if it looks good @trevnorris?

@trevnorris
Copy link
Contributor

@phillipj
Copy link
Member Author

Linting errors, that's just embarrassing - sorry about that. Synced against master and fixed style issues against updated rulesset.

The ARM test which failed (https://ci.nodejs.org/job/node-test-binary-arm/680/RUN_SUBSET=4,nodes=pi1-raspbian-wheezy/tapTestReport/test.tap-55/) seems to have a flaky history; #4055. @Trott have you seen this test being flaky after your PR got merged?

@Trott
Copy link
Member

Trott commented Jan 21, 2016

@phillipj No recent failures that I'm aware of. I'm going to go back and look at the change that test was written for and see if there's a way to test it that doesn't involve the "if it takes longer than one second, the bug is there" test which is going to be inherently flaky.

@Trott
Copy link
Member

Trott commented Jan 21, 2016

@phillipj Honestly, I think that test isn't actually valuable and should probably just be removed. But I'll take that up in a separate issue and/or PR.

Trott added a commit to Trott/io.js that referenced this pull request Jan 21, 2016
test-http-exit-delay has a flaky history. Examination of the bug it was
written to find suggests that the test should be removed.

* The test is trying to find a delay of up to 1 second, but the delay
may also be much smaller than that. So the test will not catch the bug
all the time. It is therefore flaky when the bug exists.

* Experience has shown that the test is flaky as well when the bug is
absent in the code because it can sometimes take slower devices (such as
the Raspberry Pi devices that we have in continuous integration) more
than the allowed one second to run. Increasing the timeout for those
devices will make the test pass but will also mean that the test isn't
really testing anything because the delay it is trying to catch was a
delay of up to one second.

I don't think this is an appropriate test to run once on CI. If there is
to be a test for the issue in question, it should be a benchmark test
that is run a large number of times. We don't really have such tests in
CI yet

I would argue that this test is actively a problem. It does not reliably
catch the issue it is supposed to catch, nor can it likely be made to do
so. (To do so would likely require converting it to a benchmarking test
as previously described. We don't run those in CI, at least not at this
time.)

Because this test will have both false positives and false negatives,
especially on the slower devices, it contributes to a culture of
dismissing failed tests. It does not reliably identify an issue nor does
it reliably pass on a working code base. This test should be removed.

Ref: nodejs#4277
@phillipj
Copy link
Member Author

Thanks for opening a PR deleting the test in lightning speed @Trott 👍

Trott added a commit that referenced this pull request Jan 22, 2016
test-http-exit-delay has a flaky history. Examination of the bug it was
written to find suggests that the test should be removed.

* The test is trying to find a delay of up to 1 second, but the delay
may also be much smaller than that. So the test will not catch the bug
all the time. It is therefore flaky when the bug exists.

* Experience has shown that the test is flaky as well when the bug is
absent in the code because it can sometimes take slower devices (such as
the Raspberry Pi devices that we have in continuous integration) more
than the allowed one second to run. Increasing the timeout for those
devices will make the test pass but will also mean that the test isn't
really testing anything because the delay it is trying to catch was a
delay of up to one second.

I don't think this is an appropriate test to run once on CI. If there is
to be a test for the issue in question, it should be a benchmark test
that is run a large number of times. We don't really have such tests in
CI yet

I would argue that this test is actively a problem. It does not reliably
catch the issue it is supposed to catch, nor can it likely be made to do
so. (To do so would likely require converting it to a benchmarking test
as previously described. We don't run those in CI, at least not at this
time.)

Because this test will have both false positives and false negatives,
especially on the slower devices, it contributes to a culture of
dismissing failed tests. It does not reliably identify an issue nor does
it reliably pass on a working code base. This test should be removed.

Ref: #4277
PR-URL: #4786
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Brian White <[email protected]>
MylesBorins pushed a commit that referenced this pull request Feb 17, 2016
test-http-exit-delay has a flaky history. Examination of the bug it was
written to find suggests that the test should be removed.

* The test is trying to find a delay of up to 1 second, but the delay
may also be much smaller than that. So the test will not catch the bug
all the time. It is therefore flaky when the bug exists.

* Experience has shown that the test is flaky as well when the bug is
absent in the code because it can sometimes take slower devices (such as
the Raspberry Pi devices that we have in continuous integration) more
than the allowed one second to run. Increasing the timeout for those
devices will make the test pass but will also mean that the test isn't
really testing anything because the delay it is trying to catch was a
delay of up to one second.

I don't think this is an appropriate test to run once on CI. If there is
to be a test for the issue in question, it should be a benchmark test
that is run a large number of times. We don't really have such tests in
CI yet

I would argue that this test is actively a problem. It does not reliably
catch the issue it is supposed to catch, nor can it likely be made to do
so. (To do so would likely require converting it to a benchmarking test
as previously described. We don't run those in CI, at least not at this
time.)

Because this test will have both false positives and false negatives,
especially on the slower devices, it contributes to a culture of
dismissing failed tests. It does not reliably identify an issue nor does
it reliably pass on a working code base. This test should be removed.

Ref: #4277
PR-URL: #4786
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Brian White <[email protected]>
MylesBorins pushed a commit that referenced this pull request Feb 18, 2016
test-http-exit-delay has a flaky history. Examination of the bug it was
written to find suggests that the test should be removed.

* The test is trying to find a delay of up to 1 second, but the delay
may also be much smaller than that. So the test will not catch the bug
all the time. It is therefore flaky when the bug exists.

* Experience has shown that the test is flaky as well when the bug is
absent in the code because it can sometimes take slower devices (such as
the Raspberry Pi devices that we have in continuous integration) more
than the allowed one second to run. Increasing the timeout for those
devices will make the test pass but will also mean that the test isn't
really testing anything because the delay it is trying to catch was a
delay of up to one second.

I don't think this is an appropriate test to run once on CI. If there is
to be a test for the issue in question, it should be a benchmark test
that is run a large number of times. We don't really have such tests in
CI yet

I would argue that this test is actively a problem. It does not reliably
catch the issue it is supposed to catch, nor can it likely be made to do
so. (To do so would likely require converting it to a benchmarking test
as previously described. We don't run those in CI, at least not at this
time.)

Because this test will have both false positives and false negatives,
especially on the slower devices, it contributes to a culture of
dismissing failed tests. It does not reliably identify an issue nor does
it reliably pass on a working code base. This test should be removed.

Ref: #4277
PR-URL: #4786
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Brian White <[email protected]>
@phillipj
Copy link
Member Author

What's your best advice on trigging flaky tests @Trott? Finally got a native win10 setup which made things alot easier to work with, but haven't been able to reproduce that previous cluster failure yet. So far I've repeated these two quite alot by hand:

  1. python tools/test.py --mode=release message parallel sequential -J
  2. python tools/test.py --mode=release parallel/test-cluster* -J

but I'm sure there's better ways of doing it..

@Trott
Copy link
Member

Trott commented Feb 19, 2016

@phillipj On Windows, I've done two different inelegant techniques:

  • Write a batch script to run the job over and over. I may have referred to http://stackoverflow.com/a/32152981/436641 or something like that. It's been awhile and I am not Windows savvy.
  • Put 10 or 100 copies of the test in its own directory and edit vcbuild.bat so there's a command that runs the test in that directory 10 or 100 times. (If you do 10 copies of the test 10 times, you get 100 consecutive runs.)

It may be worthwhile to run a stress test on CI against your branch to confirm that the test is still failing in CI.

@jasnell
Copy link
Member

jasnell commented Feb 19, 2016

LGTM

@phillipj
Copy link
Member Author

Thanks @Trott! Was not able to reproduce any failures with the script below :/ Would appreciate a CI run to see if it pops out again.

@echo off

for /l %%x in (1, 1, 100) do (
   Release\node.exe test\parallel\test-cluster-shared-leak.js
   echo Run #%%x: %errorlevel%
)

@Trott
Copy link
Member

Trott commented Feb 22, 2016

@phillipj So far, 2 failures in 590 runs at https://ci.nodejs.org/job/node-stress-single-test/521/nodes=win10-1p/console. Key piece of info that might explain why you're not seeing the issue on your own machine: That's a single-processor host. Same exact thing on a multiprocessor host has 1580 runs so far with 0 failures.

Failures look like this:

not ok 1 test-cluster-shared-leak.js
# events.js:155
#       throw er; // Unhandled 'error' event
#       ^
# 
# Error: write EPIPE
#     at exports._errnoException (util.js:859:11)
#     at ChildProcess.target._send (internal/child_process.js:611:20)
#     at ChildProcess.target.send (internal/child_process.js:507:19)
#     at sendHelper (cluster.js:723:8)
#     at send (cluster.js:518:5)
#     at cluster.js:493:7
#     at SharedHandle.add (cluster.js:86:3)
#     at queryServer (cluster.js:485:12)
#     at Worker.onmessage (cluster.js:433:7)
#     at ChildProcess.<anonymous> (cluster.js:737:8)

I'm stress testing the code from your branch in this PR. I suppose I could/should run a stress test against master too. If the issue already exists there, then maybe we mark it as flaky (again, insert heavy sigh here) and move on. If the issue does not exist in master, then OMG, please don't land this. :-)

@phillipj
Copy link
Member Author

@Trott nice! I don't have access to CI cause of the security lockdown
nowadays, but my fingers are crossed this is not introduced by this branch
:)

If these failures doesn't show up on CI / master, I could ofc try to
reproduce on a single CPU host, as long as you give me some hints about
what what CI stress testing actually means/performs.

On Monday, 22 February 2016, Rich Trott [email protected] wrote:

@phillipj https://github.com/phillipj So far, 2 failures in 590 runs at
https://ci.nodejs.org/job/node-stress-single-test/521/nodes=win10-1p/console.
Key piece of info that might explain why you're not seeing the issue on
your own machine: That's a single-processor host. Same exact thing on a
multiprocessor host has 1580 runs so far with 0 failures.

Failures look like this:

not ok 1 test-cluster-shared-leak.js

events.js:155

throw er; // Unhandled 'error' event

^

Error: write EPIPE

at exports._errnoException (util.js:859:11)

at ChildProcess.target._send (internal/child_process.js:611:20)

at ChildProcess.target.send (internal/child_process.js:507:19)

at sendHelper (cluster.js:723:8)

at send (cluster.js:518:5)

at cluster.js:493:7

at SharedHandle.add (cluster.js:86:3)

at queryServer (cluster.js:485:12)

at Worker.onmessage (cluster.js:433:7)

at ChildProcess. (cluster.js:737:8)

I'm stress testing the code from your branch in this PR. I suppose I
could/should run a stress test against master too. If the issue already
exists there, then maybe we mark it as flaky (again, insert heavy sigh
here) and move on. If the issue does not exist in master, then OMG, please
don't land this. :-)


Reply to this email directly or view it on GitHub
#4277 (comment).

@Trott
Copy link
Member

Trott commented Feb 22, 2016

Now stress testing against master: https://ci.nodejs.org/job/node-stress-single-test/522/nodes=win10-1p/console

@Trott
Copy link
Member

Trott commented Feb 22, 2016

@phillipj 0 failures on master branch so far. 1500 runs and counting. (I set it for 9999 total.)

In the meantime, I'm not sure how you might best emulate the stress test in your local environment. @joaocgreis may have some ideas or suggestions.

@Trott
Copy link
Member

Trott commented Feb 22, 2016

6197 runs without a failure on current master. Either this PR makes the test flaky or something odd is going on (like there's an issue on one Azure host that isn't on the another Azure host).

@phillipj
Copy link
Member Author

Sigh, rough nut to crack this one. Will have to assume it's this branch making the test flaky. I'll try to get my hands on a single CPU machine and hopefully get to reproduce the failure.

@joaocgreis
Copy link
Member

@phillipj the best way I found so far was with a virtual machine. You can set the number of processors, and with VirtualBox you can even set a CPU cap, while it's running. I've had a test running in loop and made it fail and pass by setting the cap to 10% and back.

I could give you the exact script node-stress-single-test uses if CI wasn't down, but it's pretty much a refined version of what you pasted above. The only different thing is that it uses the test runner to run the test, something like python tools/test.py parallel/test-cluster-shared-leak (without test/ and .js).

@joaocgreis
Copy link
Member

I confirmed on my Windows 10 VM (running single core, no cap) that this makes the test flaky. It is not a problem with the slaves.

@phillipj
Copy link
Member Author

@joaocgreis much appreciated!

@Trott
Copy link
Member

Trott commented Feb 23, 2016

@joaocgreis You confirmed that @phillipj's branch makes it flaky? If it's not too much trouble and if you didn't already do it, can you confirm that it's not flaky on master? Don't want @phillipj to have to do a bunch of work if it's not his code causing the issue...

@joaocgreis
Copy link
Member

It is not flaky on master, run 10k times without failing. For this PR's head, it failed twice in 1000 times.

MylesBorins pushed a commit that referenced this pull request Mar 2, 2016
test-http-exit-delay has a flaky history. Examination of the bug it was
written to find suggests that the test should be removed.

* The test is trying to find a delay of up to 1 second, but the delay
may also be much smaller than that. So the test will not catch the bug
all the time. It is therefore flaky when the bug exists.

* Experience has shown that the test is flaky as well when the bug is
absent in the code because it can sometimes take slower devices (such as
the Raspberry Pi devices that we have in continuous integration) more
than the allowed one second to run. Increasing the timeout for those
devices will make the test pass but will also mean that the test isn't
really testing anything because the delay it is trying to catch was a
delay of up to one second.

I don't think this is an appropriate test to run once on CI. If there is
to be a test for the issue in question, it should be a benchmark test
that is run a large number of times. We don't really have such tests in
CI yet

I would argue that this test is actively a problem. It does not reliably
catch the issue it is supposed to catch, nor can it likely be made to do
so. (To do so would likely require converting it to a benchmarking test
as previously described. We don't run those in CI, at least not at this
time.)

Because this test will have both false positives and false negatives,
especially on the slower devices, it contributes to a culture of
dismissing failed tests. It does not reliably identify an issue nor does
it reliably pass on a working code base. This test should be removed.

Ref: #4277
PR-URL: #4786
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Brian White <[email protected]>
This changes the behaviour of error events emitted when process.send()
fails, from synchronously to asynchronously.
@phillipj phillipj force-pushed the emit-cp-errors-nt branch from ab5589e to 3c38e87 Compare March 9, 2016 21:45
@phillipj
Copy link
Member Author

I assume this PR should be closed in favor of #5251 which fixes a lot more of these inconsistent emits. Though as I mentioned in #5251 (comment) it would be nice to ensure this exact change doesn't merge flakyness into master.

Capping the CPU as @joaocgreis mentioned worked like a charm. It was easy to reproduce while being in sync with master around 21th january with 922d538 as parent commit. Being in sync with master today seems to fix the issue though, as I've not been able to reproduce it when rebasing on top of master.

I've updated the PR and would appreciate a new stress test to verify it's okey now, as a means of removing the uncertainty from #5251. @Trott would you be so kind?

@Trott
Copy link
Member

Trott commented Mar 11, 2016

@phillipj Are you asking for a stress test for this PR on Windows?

@phillipj
Copy link
Member Author

@Trott Yes. On second thought, I could start the stress test myself now. I'll bug you on IRC for build details if that's okey.

@phillipj
Copy link
Member Author

Stress test passed: https://ci.nodejs.org/job/node-stress-single-test/nodes=win10-1p/553/

Closing this in favor of #5251. Will reopen if that PR doesn't land.

@phillipj phillipj closed this Mar 12, 2016
@thefourtheye
Copy link
Contributor

@phillipj Cool. Thanks for confirming that this indeed works as expected :-)

scovetta pushed a commit to scovetta/node that referenced this pull request Apr 2, 2016
test-http-exit-delay has a flaky history. Examination of the bug it was
written to find suggests that the test should be removed.

* The test is trying to find a delay of up to 1 second, but the delay
may also be much smaller than that. So the test will not catch the bug
all the time. It is therefore flaky when the bug exists.

* Experience has shown that the test is flaky as well when the bug is
absent in the code because it can sometimes take slower devices (such as
the Raspberry Pi devices that we have in continuous integration) more
than the allowed one second to run. Increasing the timeout for those
devices will make the test pass but will also mean that the test isn't
really testing anything because the delay it is trying to catch was a
delay of up to one second.

I don't think this is an appropriate test to run once on CI. If there is
to be a test for the issue in question, it should be a benchmark test
that is run a large number of times. We don't really have such tests in
CI yet

I would argue that this test is actively a problem. It does not reliably
catch the issue it is supposed to catch, nor can it likely be made to do
so. (To do so would likely require converting it to a benchmarking test
as previously described. We don't run those in CI, at least not at this
time.)

Because this test will have both false positives and false negatives,
especially on the slower devices, it contributes to a culture of
dismissing failed tests. It does not reliably identify an issue nor does
it reliably pass on a working code base. This test should be removed.

Ref: nodejs#4277
PR-URL: nodejs#4786
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Brian White <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. semver-major PRs that contain breaking changes and should be released in the next major version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants