-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate flaky test-cpu-prof #27611
Comments
I think this is caused by the function finishing too fast before it gets sampled in the profile. I have a rewritten version of the test here which may be more stable than the current one, though it runs longer (the current one takes < 1 sec on my laptop while the new one takes about 3 sec), because it intentionally keeps the function being profiled running for much longer than the sampling interval (1000x) to make sure it shows up in the profile. I will see if https://ci.nodejs.org/job/node-test-commit/28455/ works before opening a PR. |
Although, on the other hand, another approach would be to not looking for a specific function in the profile at all but instead only make sure the profile exits and can be parsed. But this is kind of like the tick processor test - we want to make sure the generated profile actually can be analyzed by users and the functionality don't get broken randomly by V8. |
From https://ci.nodejs.org/job/node-test-commit/28459/ there is an alarming DCHECK: See failures on test-digitalocean-ubuntu1604_sharedlibs_container-x64-7:
And from a glance I think test failures in release builds all have frames where lineNumber = -1 so it might be an upstream bug. EDIT: it looks like the -1s are just for synthetic nodes which makes sense.. |
10:18:23 not ok 2340 sequential/test-cpu-prof
10:18:23 ---
10:18:23 duration_ms: 480.79
10:18:23 severity: fail
10:18:23 exitcode: -15
10:18:23 stack: |-
10:18:23 timeout
10:18:23 ... |
And again...I wondered if these particular failures are just because debug build is slow, but nope. On a successful run, this test takes 4 seconds or so on the debug build.
|
The test is failing consistently on debug builds in CI. Last success was about 36 hours ago. /cc @joyeecheung @addaleax (based on |
This takes 29 seconds to finish on a debug build for me locally..maybe we should split it into multiple tests instead. |
Split test-cpu-prof.js into multiple files for different test cases so it's easier to find the problematic one if it flakes. Also move the split tests into parallel. PR-URL: #28170 Refs: #27611 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Yongsheng Zhang <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Sam Roberts <[email protected]> Reviewed-By: Trivikram Kamat <[email protected]>
@Trott just a minor follow-up to your notes in IRC: all of the test-softlayer-ubuntu1604_sharedlibs_container-x64-? Jenkins nodes are all running on the same single host server, test-softlayer-ubuntu1604-docker-x64-1 (which you have access to if you're brave enough to try and debug locally). If it's true that you're only getting failures on test-softlayer-ubuntu1604_sharedlibs_container-x64-? machines and not the other sharedlibs machines then it could be a local problem. But, you mentioned test-softlayer-ubuntu1604_sharedlibs_container-x64-7 which isn't a thing, there's test-digitalocean-ubuntu1604_sharedlibs_container-x64-7 but that's on one of the two DigitalOcean hosts that run these containers (one has the even numbers and the other has the odd numbers). We also have one running on Joyent too so that may show up occasionally. |
I did a quick scan and it seems after #28170 the flake has only showed up on freebsd due to being unable to find the target function in the samples e.g. https://ci.nodejs.org/job/node-test-commit-freebsd/26826/nodes=freebsd11-x64/consoleFull |
FreeBSD tends to be our most sensitive host in terms of "failures caused by running a bunch of processes all at once rather than just a few or even one at a time". I'm still on Team Move-It-To-Sequential, but I seem to be the only one. :-| I can get on Team More-Lenient-Checking though. |
I opened #28210 to move them to sequential - if the flake ever shows up again, we should consider relaxing the tests. |
Split test-cpu-prof.js into multiple files for different test cases so it's easier to find the problematic one if it flakes. Also move the split tests into parallel. PR-URL: #28170 Refs: #27611 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Yongsheng Zhang <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Sam Roberts <[email protected]> Reviewed-By: Trivikram Kamat <[email protected]>
The tests still fail after being split into multiple files, (2 out of 30 runs in roughly 48 hours) and the causes are missing target frames in the samples. This patch moves them to sequential to observe if the flakiness can be fixed when the tests are run on a system with less load. If the flake ever shows up again even after the tests are moved to sequential, we should consider make the test conditions more lenient - that is, we would only assert that there are *some* frames in the generated CPU profile but do not look for the target function there. PR-URL: #28210 Refs: #27611 Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Michael Dawson <[email protected]> Reviewed-By: James M Snell <[email protected]>
The tests still fail after being split into multiple files, (2 out of 30 runs in roughly 48 hours) and the causes are missing target frames in the samples. This patch moves them to sequential to observe if the flakiness can be fixed when the tests are run on a system with less load. If the flake ever shows up again even after the tests are moved to sequential, we should consider make the test conditions more lenient - that is, we would only assert that there are *some* frames in the generated CPU profile but do not look for the target function there. PR-URL: #28210 Refs: #27611 Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Michael Dawson <[email protected]> Reviewed-By: James M Snell <[email protected]>
I just saw this failure: https://ci.nodejs.org/job/node-test-commit-linuxone/nodes=rhel72-s390x/15156/testReport/junit/(root)/test/sequential_test_cpu_prof_dir_worker/
|
flaky on lots of platforms, it seems:
|
Refs: #27611 (comment) PR-URL: #32828 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Myles Borins <[email protected]> Reviewed-By: Rich Trott <[email protected]>
Refs: #27611 (comment) PR-URL: #32828 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Myles Borins <[email protected]> Reviewed-By: Rich Trott <[email protected]>
Refs: nodejs#27611 (comment) PR-URL: nodejs#32828 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Myles Borins <[email protected]> Reviewed-By: Rich Trott <[email protected]>
Refs: #27611 (comment) PR-URL: #32828 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Myles Borins <[email protected]> Reviewed-By: Rich Trott <[email protected]>
Refs: #27611 (comment) PR-URL: #32828 Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Myles Borins <[email protected]> Reviewed-By: Rich Trott <[email protected]>
https://ci.nodejs.org/job/node-test-commit-freebsd/26122/nodes=freebsd11-x64/console
test-digitalocean-freebsd11-x64-2
@joyeecheung
The text was updated successfully, but these errors were encountered: