Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

deps: don't busy loop in v8 cpu profiler thread #25268

Conversation

tunniclm
Copy link

@tunniclm tunniclm commented May 8, 2015

Backport 6964a9e0685fa186d9d9b7907be17505e839db1a from upstream v8.

Original commit message:

Make CPU profiler do not hog 100% of CPU.

Tick event processor should not stay in a tight loop
when there's nothing to do. It can go sleep until next sample event.

LOG=N
BUG=v8:3967
Committed: https://crrev.com/6964a9e0685fa186d9d9b7907be17505e839db1a
Cr-Commit-Position: refs/heads/master@{#28211}

Fixes #25137
Related: #9439, #8789

Backport 6964a9e0685fa186d9d9b7907be17505e839db1a from upstream v8.

Original commit message:

  Make CPU profiler do not hog 100% of CPU.

  Tick event processor should not stay in a tight loop
  when there's nothing to do. It can go sleep until next sample event.

  LOG=N
  BUG=v8:3967
  Committed: https://crrev.com/6964a9e0685fa186d9d9b7907be17505e839db1a
  Cr-Commit-Position: refs/heads/master@{#28211}

Fixes nodejs#25137
Related: nodejs#9439, nodejs#8789
@misterdjules
Copy link

LGTM, thank you!

@misterdjules
Copy link

@tunniclm Testing this PR, I cannot even reproduce the 100% CPU issue in the first place. I've tried to reproduce it by running:

$ node --prof -e "setInterval(function(){}, 10)"

without success. How do you reproduce this problem?

@tunniclm
Copy link
Author

@misterdjules The problem is with the CpuProfiler API, which is a native V8 API. The problem can be reproduced via a Node add on that starts the profiler in this way. The --prof option is not affected as the code path is different with that option (--prof writes events/samples to file, whereas CpuProfiler does runtime processing of the events between samples to keep an up-to-date mapping of code addresses.)

@tunniclm
Copy link
Author

An example addon to reproduce the problem (for Node 0.12 / v8 3.28), start_profiler.cc:

#include "node.h"
#include "v8.h"
#include "v8-profiler.h"

using namespace v8;

void Start(const FunctionCallbackInfo<Value>& args) {
    Isolate* isolate = Isolate::GetCurrent();
    HandleScope scope(isolate);
    isolate->GetCpuProfiler()->StartProfiling(String::NewFromUtf8(isolate, "OurProfile"));
}

void Initialize(Handle<Object> exports) {
    NODE_SET_METHOD(exports, "start", Start);
}

NODE_MODULE(start_profiler, Initialize)

With binding.gyp:

{
    "targets": [
        {
            "target_name": "start_profiler",
            "sources": [ "start_profiler.cc" ]
        }
    ]
}

And test.js:

require('./build/Release/start_profiler').start();
setInterval(function(){}, 10);

@mhdawson
Copy link
Member

lgtm

@misterdjules
Copy link

@tunniclm I just noticed that there is no regression test in the changes from V8 upstream. I think it would be helpful if we could come up with a regression test, as it would prevent us from breaking that again.

We would probably need to write it in a way similar to the tests in test/addons. If we can come up with something that works well, then we might also consider adding the addons tests suite to the tests suite that are run by most of the CI jobs (like for instance node-accept-pull-request).

I'm OK to land this without a regression test for now as such a regression would not be a severe issue, but I think it would help us to have a regression test sooner than later.

@misterdjules
Copy link

@tunniclm Also, thank you very much for the more detailed information. I was able to reproduce the problem and verify the fix in this PR, so LGTM.

@misterdjules
Copy link

Triggered node-accept-pull-request to land this PR, so it should be merged in v0.12 soon. Thank you @tunniclm!

misterdjules pushed a commit to misterdjules/node that referenced this pull request May 12, 2015
Backport 6964a9e0685fa186d9d9b7907be17505e839db1a from upstream v8.

Original commit message:

  Make CPU profiler do not hog 100% of CPU.

  Tick event processor should not stay in a tight loop
  when there's nothing to do. It can go sleep until next sample event.

  LOG=N
  BUG=v8:3967
  Committed: https://crrev.com/6964a9e0685fa186d9d9b7907be17505e839db1a
  Cr-Commit-Position: refs/heads/master@{#28211}

Fixes nodejs#25137
Related: nodejs#9439, nodejs#8789

PR: nodejs#25268
PR-URL: nodejs#25268
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: Julien Gilli <[email protected]>
@misterdjules
Copy link

@joyent/node-coreteam The node-accept-pull-request Jenkins job keeps failing because test/simple/test-tls-server-verify.js takes more than 60 seconds to run on our Windows 2012 server Jenkins agent. It doesn't fail on that machine, it takes around 80 seconds and thus times out.

I made the decision to move forward with this and land this PR anyway since it was the only test failing, and I was able to make sure the test actually passes on two different Windows systems (2012 server and Windows 7).

As a result, landed in 80cdae8 and created #25279.

@misterdjules
Copy link

FWIW, also added a reference to this floating patch in the documentation on our V8 upgrade process in the Wiki.

@tunniclm
Copy link
Author

@misterdjules I have a test that could work, but it currently has a dependency on an npm module for measuring CPU usage. Do the tests in test/addons need to be self-contained, without external dependencies?

@misterdjules
Copy link

@tunniclm Currently the only tests with external dependencies that I'm aware of are the ones in test/timers and test/external. The former has a dependency on the libfaketime library, which is in tools/faketime, the latter has dependencies on several npm modules.

These tests are not run automatically by the CI platform when landing pull requests. I think the reason for that are:

  1. To keep the execution time of the tests suite that is run on every commit reasonable. The current timeout for a single test is 60 seconds, npm install or building libfaketime could take longer.
  2. To prevent errors in the dependencies installation process from making the tests suite fail. For instance, if for some reason a GitHub repository or the npm registry is not available (which is rare), then we don't want to block the CI platform.

What is the npm module that you'd need to use in this case? If we can't manage to write this test without a dependency on a npm module, we could explore different options:

  1. Commit the node_modules directory for this test and if it depends on a binary add-on, run npm rebuild before it runs. We would probably want to do that anyway instead of running npm install every time.
  2. Run that test only when doing a release. That way it doesn't run on every commit, but at least we can catch issues before we publish a new release.

We might need to reorganize the tests directory a bit.

@tunniclm Thoughts?

@tunniclm
Copy link
Author

tunniclm commented Jun 8, 2015

@misterdjules At present, the test relies on the usage module to gather the process CPU usage. I just ran the test on Windows and found usage doesn't support Windows. Having said that, the high CPU issue is not solved on Windows anyway.

@misterdjules
Copy link

@tunniclm Given that the high CPU usage issue is not solved on Windows, I would think that the best solution currently would be to:

  1. Add a regression test for this issue to test/external/v8-profiler-cpu-usage and include the npm dependencies in the repository (the whole node_modules directory).
  2. When running python tools/test.py external, run each test in test/external/ subdirectories and run npm rebuild as a pre-run step.
  3. Add external tests to the tests run by the CI platform during the node-test-commit job (on UNIX and on Windows.
  4. Move test/external/ssl-options to test/ssl/options-matrix. test/external/ssl-options currently takes several minutes to complete, and the tests run for every commit should take only a few seconds at most, with a default timeout of 60 seconds, so this test shouldn't run automatically.

It seems that adding Windows support to https://github.com/arunoda/node-usage could be reasonably simple, so that approach would work on Windows too when node-usage supports Windows CPU usage gathering.

Before moving on though, I would like to get some feedback from @joyent/node-tsc and especially @orangemocha and @mhdawson because that's a slight departure from what has been done in tests run automatically by the CI platform in the past.

@orangemocha
Copy link
Contributor

@misterdjules arguably we need a better way of defining test suites than just using subdirectories. Our test runner already supports some basic predicates to specify the expected test outcomes based on the run environment. We currently use this for defining flaky tests (see https://github.com/joyent/node/blob/v0.12/test/internet/internet.status#L5). One of the possible outcomes is SKIP, which means that the test is not run. By adding an environment var 'runtype' we might be able to use these status files to configure what gets run in CI versus other test run types.

But I think it would be best to defer these changes till after we have reconciled CI with io.js. So the directory restructuring you suggest sounds reasonable as an interim solution.

Regarding the npm rebuild phase, can that not be encapsulated in the test itself?

@misterdjules
Copy link

@orangemocha The npm rebuild phase could be encapsulated in the test itself, but it seems that even if abstracting something like common.npmRebuild(callback) in test/common.js, there would be some sort of duplication in every test that needs this mechanism. Having that in the tests driver would allow for more code reuse.

I don't have a very strong opinion though, if doing the npm rebuild phase in each test works and is implemented cleanly (with reusable code in test/common.js), it's already much better than having no test at all.

Thank you!

@orangemocha
Copy link
Contributor

@misterdjules if it can be implemented in a reusable way in test/common.js, I think that would be the best option. Otherwise we'd be moving test logic into the test runner, which seems less robust.

@mhdawson
Copy link
Member

@misterdjules in respect to node-usage I'd want to check its behaviour on pLinuz, zLinux and AIX

@mhdawson
Copy link
Member

@ misterdjules is there some existing doc which describes the different test "buckets" and when we believe each set should be run ? A table might be most useful that lists the bucket (directory name right now) when it run (part of make test, at PR accept time, or as an extended test etc.)

That would help me to respond to your question

As I understand it now the main difference is that we'd add running test/external to what the CI runs by default ?

@misterdjules
Copy link

@mhdawson There is no documentation about that, it would be great to have it and link it from the contributing guide, both on nodejs.org and in CONTRIBUTING.md in the repository.

Yes, my suggestion is to add tests in test/external to what the CI runs by default. The main goal though is to add tests that require external dependencies but that run quickly to the set of tests run by the node-test-commit jobs on Jenkins. There is more than one way to do that, one of them being to add test/external to the list of tests that are run by this job.

Currently, "quickly" means "in less than 60 seconds". I'm not sure if npm rebuild can run under 60 seconds reliably on all platforms when installing a small number of native dependencies, but I think it's worth experimenting instead of leaving these tests out of the default CI tests suite from the start.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants