-
Notifications
You must be signed in to change notification settings - Fork 47
Prototype for running a subset of benchmarks on a pull request #58
Conversation
Great, this is exactly what I would like to see. Perhaps we can estimate the running time and schedule the job accordingly. That would allow for the daily regression runs. |
I think we want to be able to test the current head so can we make PULL_ID optional ? In terms of a comparison point maybe we to make the BASE (currently BRANCH), TARGET (currently PULL_REQUEST) more flexible so they can be a git hash |
…ripts in node/benchmark
@gareth-ellis thanks for spending your time on this. |
Right, I started the write a response, but decided it sounded too hacky. I've reworked it so there's two current use cases. They're described in the usage, however: This script has two use cases: CATEGORY = a category of tests to run - folders in benchmark Use case 2: We want to compare two branches, tags or commits. The following are optional across both use cases |
I thought it would be a good idea to do a run and time each benchmark folder/category to give us an idea of how long this may take. It's not run on the benchmark machine, but it will give us a ball park number: So far (the test is running still): Will update as more of the tests complete, however we will need to make sure at a minimum that only a small subset of buffer tests are run, otherwise we'd risk having the benchmark machine miss the regular regression runs |
Here is a list of the time usage for each benchmark that I made some time ago: Note that this is for a single run and not comparing, so you need to multiply by 60. |
@gareth-ellis what is the status of this? I got reminded by nodejs/node#8873. |
I believe we're waiting for the OK from @mhdawson and then it will need adding to the jenkins CI by someone with access |
I was waiting to see the full list mentioned by gareth and then we can decide what's reasonable to let people run. To start I don't think we want to let people start 30 hour runs... I'm thinking a limit of something like an hour makes sense. Is there a way to shorten the ones that run longer yet still get useful data ? |
@mhdawson please see https://gist.github.com/AndreasMadsen/c6bb866f9bba318bdcc10869d25e8e34 for the full list. Just multiply by 60 for default parameters. edit: the list can be generated with:
|
Ok, when I find time I'll add the numbers together in https://gist.github.com/AndreasMadsen/c6bb866f9bba318bdcc10869d25e8e34 to get the totals for the individual groups. Still interested in any input on this question: |
Here are the aggregated results: https://gist.github.com/AndreasMadsen/9f9739514116488c27c06ebaff813af1 remember to multiple by 60 for default parameters. We can also add a node benchmark/run.js $FILTER -- $CATEGORY to get a list of all the executed files / parameters. This could then be cross referenced with some estimated time list. Of cause such a time estimate could be far off if something introduced a deoptimization or the default parameters where changed. |
Thanks for putting together the overview. I like the idea of a --dry-run that provides an running time estimate, we could then start off by limiting runs to something like 4 hours. I know people could launch multiples but I'm hoping people would avoid abusing the job. Assuming the aggregate numbers are in seconds a multiplier of 60 still seems like we'd get very long runs for some of the groupings. I'm assuming that often people will want to run for one of the groupings that they think might be affected by change. Would some shorter number be possible while still getting useful results ? |
My "person memory" is quite poor, but I think I talked to @mcollina and @thealphanerd about getting more resources for benchmarking at NodeConf EU.
All correct.
For some groupings yes, for other no. The groupings are based on each node core module, however an optimization rarely affects all parts of a module. Thus a subset of a grouping is typically used.
I would guess that some of the benchmarks performs many more iterations than they actually need to. But generally speaking it is impossible to say how many runs are actually needed. I choose 30 (60 = 30 runs * 2 node versions) originally because that is where the error between a t-distribution and normal-distribution becomes fairly small (opinions vary, but this is the most optimistic number I have heard). When this error is small having non-normally distribution benchmark values doesn't affect the statistical calculations that much. In plain english: 30 is the most optimistic threshold between not safe and maybe safe. |
Bump. @mhdawson , what do we need to do next to get this available for use? I appreciate we need to make sure we don't end up colliding with other benchmark jobs, but having the facility available (even if to a small group of people to start with who are able to launch comparison jobs.) . If we find ourselves needing more time than available, then perhaps we should be looking at possibilities to expand capacity. Cc / @AndreasMadsen |
@gareth-ellis I talked to some CTC member at NodeConf EU, they though we should just get a dedicated server for this purpose. |
+1 for a dedicated server |
I've created a request in the CTC group: nodejs/CTC#55 |
The running time is only going to be longer as more benchmarks are added. Maybe we can figure out a reasonable time limit for new benchmarks added in the future and put this limit in the docs as a suggestion (possibly try to migrate the old ones too)? I too think that some benchmark might perform more iterations than actually needed. And if they do that, people tend to cut down the |
So yes the length of time will increase as more benchmarks are added, however the "issue" is the length of time it would take to run an entire suite. I say issue, but really the benchmarks would be run as part of targetted testing, so if we find ourselves running the whole suite, then this isn't really ideal (see my comment from 4th September - I had to stop it running after a full weekend of running!). Perhaps we could also add a select subset for general testing, but to start with I think we need to make sure that we are narrowing down the benchmarks to be run sufficiently, rather than reducing the number of iterations per test or trying to limit the amount of time an individual benchmark can spend running. If the benchmark gives us useful data in an efficient way, then that should be most important. I wonder if once we've got this running, I may look at putting together a calculator for a set of settings - I can't imagine it being too tricky..... |
@gareth-ellis I agree that we should narrow down the subset of benchmarks first, but IMO having a soft limit can help to avoid things getting out of hand. And if the whole benchmark suites takes longer than 24 hours to run then it would be hard to do daily regression runs wouldn't it? |
Indeed, but the idea of this item to start with at least is to have the benchmarks available for running against, for example, a proposed PR to verify potentially performance-risky changes. I agree the next step once this works (and perhaps we learn more...) would then be to implement a subset for regular regression runs |
@gareth-ellis Ah yes, sorry if I've gone off topic too far, just haven't seen the issue about running time brought up anywhere else :P |
Just a not that I've been told Intel will be donating a machine through Nearform. Once we get that we can use that new machine to re-start work related to this PR. |
We are currently in the process of getting that wired up to the network. It's taking a while, but it's out of our hands (physical things), as we are in the process of changing provider. |
@mcollina any update on the progress of getting access to the machine ? |
@gareth-ellis do you have any cycles to work on this now ? We just got 2 new machines setup and so if you have time to get together we could probably review/land this PR, and then setup a CI job that runs the scripts targeted at those machines. |
I potentially have some time next week, failing that the week after. I read the background, it seems its being put in as running tests at the moment, so we can assume it should have all the tools we need (perhaps might need to install Rscript which is used for some of the statistics). Then will be a matter of creating the job in jenkins, is there any chance of getting access to configure the benchmark jobs? I tried to view the config for a benchmark last week, but don't seem to be able to see it, I guess this means I don't have the right permissions. |
Hi @mhdawson , When would be a good time to look at this? |
New machines have been donated by Intel which we can use to run these tests without interfering with the regular benchmarking jobs. Next steps would be
|
@gareth-ellis I created this job for you to edit/configure to run the micro-benchmarks and set it so that it will only run on one of the 2 new machines from intel: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/ For now it will only run on: test-nearform_intel-ubuntu1604-x64-1 Later on we can configure to run on either of the 2 new machines if thats appropriate. I think you should have everything you need to proceed. You can land the PR and then work on configuration the jenkins job. Let me know if there is anything else you need. |
@nodejs/benchmarking anyone else want to take a look before I land? Would be good to get a few more LGTMs |
I have a couple of PRs I would like to try this on. |
Landed in 3d788c2 |
opened #127 to track progress |
fyi @mhdawson @AndreasMadsen
Following on from nodejs/node#8157 here is a prototype for running a subset of benchmarks on two builds of node - .
We'll need to have the ability to set :
mandatory BRANCH
mandatory PULL_ID
mandatory CATEGORY
optional RUNS
optional FILTER
in Jenkins - the script will then use those environment variables to do the right thing.
I can't add these myself, due to access permissions on Jenkins.
I'd suggest we give this a go at least, and have agreement regarding the amount of work the benchmark machine can reasonably do - I believe the priority should be the daily regression runs, though expanding these to run benchmarks from node source should also be a priority. I believe running this prototype will give us some insight into how the runs perform on the benchmark machine, and how best to implement the run.
I should also add some clean up work - perhaps archiving away the output files for future use.
Thoughts?