Skip to content
This repository has been archived by the owner on Nov 28, 2020. It is now read-only.

Prototype for running a subset of benchmarks on a pull request #58

Merged
merged 5 commits into from
Aug 3, 2017
Merged

Prototype for running a subset of benchmarks on a pull request #58

merged 5 commits into from
Aug 3, 2017

Conversation

gareth-ellis
Copy link
Member

fyi @mhdawson @AndreasMadsen
Following on from nodejs/node#8157 here is a prototype for running a subset of benchmarks on two builds of node - .

We'll need to have the ability to set :
mandatory BRANCH
mandatory PULL_ID
mandatory CATEGORY
optional RUNS
optional FILTER

in Jenkins - the script will then use those environment variables to do the right thing.

I can't add these myself, due to access permissions on Jenkins.

I'd suggest we give this a go at least, and have agreement regarding the amount of work the benchmark machine can reasonably do - I believe the priority should be the daily regression runs, though expanding these to run benchmarks from node source should also be a priority. I believe running this prototype will give us some insight into how the runs perform on the benchmark machine, and how best to implement the run.

I should also add some clean up work - perhaps archiving away the output files for future use.

Thoughts?

@gareth-ellis gareth-ellis changed the title Pr benchmark Prototype for running a subset of benchmarks on a pull request Aug 23, 2016
@AndreasMadsen
Copy link
Member

Great, this is exactly what I would like to see. Perhaps we can estimate the running time and schedule the job accordingly. That would allow for the daily regression runs.

@mhdawson
Copy link
Member

I think we want to be able to test the current head so can we make PULL_ID optional ? In terms of a comparison point maybe we to make the BASE (currently BRANCH), TARGET (currently PULL_REQUEST) more flexible so they can be a git hash

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Sep 2, 2016

@gareth-ellis thanks for spending your time on this.

@gareth-ellis
Copy link
Member Author

gareth-ellis commented Sep 2, 2016

Right,

I started the write a response, but decided it sounded too hacky. I've reworked it so there's two current use cases. They're described in the usage, however:

This script has two use cases:
Use case 1: We want to test the impact of a PR on a branch.
To run this, declare:

CATEGORY = a category of tests to run - folders in benchmark
BRANCH = the branch the test should be based off. e.g. master
PULL_ID = the pull request that contains changes to test


Use case 2: We want to compare two branches, tags or commits.
To run this, declare:
CATEGORY = a category of tests to run - folders in benchmark
BASE = the branch/tag/commit the test should be based off. e.g. master
TARGET = the branch/tag/commit to compare against base


The following are optional across both use cases
RUNS = defaults to empty
FILTER = defaults to empty
MACHINE_THREADS - used for building node. Defaults to all threads on machine

@gareth-ellis
Copy link
Member Author

gareth-ellis commented Sep 4, 2016

I thought it would be a good idea to do a run and time each benchmark folder/category to give us an idea of how long this may take. It's not run on the benchmark machine, but it will give us a ball park number:

So far (the test is running still):
domain took 17 seconds
crypto took 9360 seconds
buffers took 107026 seconds (yes, nearly 30 hours...)
querystring took 1926 seconds
streams took 1898 seconds
assert took 2336 seconds
dgram took 21892 seconds
path took 1954 seconds
arrays took 538 seconds
net took 13691 seconds
doc_img took 0 seconds
module took 1446 seconds
process took 220 seconds

Will update as more of the tests complete, however we will need to make sure at a minimum that only a small subset of buffer tests are run, otherwise we'd risk having the benchmark machine miss the regular regression runs

@AndreasMadsen
Copy link
Member

Here is a list of the time usage for each benchmark that I made some time ago:
https://gist.github.com/AndreasMadsen/c6bb866f9bba318bdcc10869d25e8e34

Note that this is for a single run and not comparing, so you need to multiply by 60.

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Oct 1, 2016

@gareth-ellis what is the status of this? I got reminded by nodejs/node#8873.

@gareth-ellis
Copy link
Member Author

I believe we're waiting for the OK from @mhdawson and then it will need adding to the jenkins CI by someone with access

@mhdawson
Copy link
Member

mhdawson commented Oct 3, 2016

I was waiting to see the full list mentioned by gareth and then we can decide what's reasonable to let people run. To start I don't think we want to let people start 30 hour runs... I'm thinking a limit of something like an hour makes sense. Is there a way to shorten the ones that run longer yet still get useful data ?

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Oct 3, 2016

@mhdawson please see https://gist.github.com/AndreasMadsen/c6bb866f9bba318bdcc10869d25e8e34 for the full list. Just multiply by 60 for default parameters.

edit: the list can be generated with:

node benchmark/run.js --format csv -- \
  arrays assert buffers child_process crypto dgram domain \
  es events fs http misc module net path process querystring \
  streams string_decoder timers tls url util

@mhdawson
Copy link
Member

mhdawson commented Oct 4, 2016

Ok, when I find time I'll add the numbers together in https://gist.github.com/AndreasMadsen/c6bb866f9bba318bdcc10869d25e8e34 to get the totals for the individual groups. Still interested in any input on this question:
Is there a way to shorten the ones that run longer yet still get useful data ?

@AndreasMadsen
Copy link
Member

Here are the aggregated results: https://gist.github.com/AndreasMadsen/9f9739514116488c27c06ebaff813af1

remember to multiple by 60 for default parameters. We can also add a --dry-run option to benchmark/run.js if that helps. That way you could run:

node benchmark/run.js $FILTER -- $CATEGORY

to get a list of all the executed files / parameters. This could then be cross referenced with some estimated time list. Of cause such a time estimate could be far off if something introduced a deoptimization or the default parameters where changed.

@mhdawson
Copy link
Member

Thanks for putting together the overview. I like the idea of a --dry-run that provides an running time estimate, we could then start off by limiting runs to something like 4 hours. I know people could launch multiples but I'm hoping people would avoid abusing the job.

Assuming the aggregate numbers are in seconds a multiplier of 60 still seems like we'd get very long runs for some of the groupings. I'm assuming that often people will want to run for one of the groupings that they think might be affected by change. Would some shorter number be possible while still getting useful results ?

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Oct 20, 2016

we could then start off by limiting runs to something like 4 hours.

My "person memory" is quite poor, but I think I talked to @mcollina and @thealphanerd about getting more resources for benchmarking at NodeConf EU.

Assuming the aggregate numbers are in seconds a multiplier of 60 still seems like we'd get very long runs for some of the groupings.

All correct.

I'm assuming that often people will want to run for one of the groupings that they think might be affected by change.

For some groupings yes, for other no. The groupings are based on each node core module, however an optimization rarely affects all parts of a module. Thus a subset of a grouping is typically used.

Would some shorter number be possible while still getting useful results ?

I would guess that some of the benchmarks performs many more iterations than they actually need to. But generally speaking it is impossible to say how many runs are actually needed.

I choose 30 (60 = 30 runs * 2 node versions) originally because that is where the error between a t-distribution and normal-distribution becomes fairly small (opinions vary, but this is the most optimistic number I have heard). When this error is small having non-normally distribution benchmark values doesn't affect the statistical calculations that much.

In plain english: 30 is the most optimistic threshold between not safe and maybe safe.

@gareth-ellis
Copy link
Member Author

Bump.

@mhdawson , what do we need to do next to get this available for use? I appreciate we need to make sure we don't end up colliding with other benchmark jobs, but having the facility available (even if to a small group of people to start with who are able to launch comparison jobs.) . If we find ourselves needing more time than available, then perhaps we should be looking at possibilities to expand capacity.

Cc / @AndreasMadsen

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Dec 25, 2016

@gareth-ellis I talked to some CTC member at NodeConf EU, they though we should just get a dedicated server for this purpose.

@mscdex
Copy link

mscdex commented Jan 4, 2017

+1 for a dedicated server

@AndreasMadsen
Copy link
Member

I've created a request in the CTC group: nodejs/CTC#55

@joyeecheung
Copy link
Member

The running time is only going to be longer as more benchmarks are added. Maybe we can figure out a reasonable time limit for new benchmarks added in the future and put this limit in the docs as a suggestion (possibly try to migrate the old ones too)? I too think that some benchmark might perform more iterations than actually needed. And if they do that, people tend to cut down the n or the --run when running those so they can finish the benchmarks in a reasonable amount of time anyway, and that defeats the whole point of all these iterations...

@gareth-ellis
Copy link
Member Author

So yes the length of time will increase as more benchmarks are added, however the "issue" is the length of time it would take to run an entire suite. I say issue, but really the benchmarks would be run as part of targetted testing, so if we find ourselves running the whole suite, then this isn't really ideal (see my comment from 4th September - I had to stop it running after a full weekend of running!). Perhaps we could also add a select subset for general testing, but to start with I think we need to make sure that we are narrowing down the benchmarks to be run sufficiently, rather than reducing the number of iterations per test or trying to limit the amount of time an individual benchmark can spend running. If the benchmark gives us useful data in an efficient way, then that should be most important.

I wonder if once we've got this running, I may look at putting together a calculator for a set of settings - I can't imagine it being too tricky.....

@joyeecheung
Copy link
Member

joyeecheung commented Jan 4, 2017

@gareth-ellis I agree that we should narrow down the subset of benchmarks first, but IMO having a soft limit can help to avoid things getting out of hand. And if the whole benchmark suites takes longer than 24 hours to run then it would be hard to do daily regression runs wouldn't it?

@gareth-ellis
Copy link
Member Author

Indeed, but the idea of this item to start with at least is to have the benchmarks available for running against, for example, a proposed PR to verify potentially performance-risky changes.

I agree the next step once this works (and perhaps we learn more...) would then be to implement a subset for regular regression runs

@joyeecheung
Copy link
Member

@gareth-ellis Ah yes, sorry if I've gone off topic too far, just haven't seen the issue about running time brought up anywhere else :P

@mhdawson
Copy link
Member

Just a not that I've been told Intel will be donating a machine through Nearform. Once we get that we can use that new machine to re-start work related to this PR.

@mcollina
Copy link
Member

mcollina commented May 18, 2017

We are currently in the process of getting that wired up to the network. It's taking a while, but it's out of our hands (physical things), as we are in the process of changing provider.

@mhdawson
Copy link
Member

@mcollina any update on the progress of getting access to the machine ?

@mhdawson
Copy link
Member

@gareth-ellis do you have any cycles to work on this now ? We just got 2 new machines setup and so if you have time to get together we could probably review/land this PR, and then setup a CI job that runs the scripts targeted at those machines.

@gareth-ellis
Copy link
Member Author

I potentially have some time next week, failing that the week after.

I read the background, it seems its being put in as running tests at the moment, so we can assume it should have all the tools we need (perhaps might need to install Rscript which is used for some of the statistics).

Then will be a matter of creating the job in jenkins, is there any chance of getting access to configure the benchmark jobs? I tried to view the config for a benchmark last week, but don't seem to be able to see it, I guess this means I don't have the right permissions.

@gareth-ellis
Copy link
Member Author

Hi @mhdawson ,

When would be a good time to look at this?

mhdawson

This comment was marked as off-topic.

@mhdawson
Copy link
Member

mhdawson commented Aug 1, 2017

New machines have been donated by Intel which we can use to run these tests without interfering with the regular benchmarking jobs.

Next steps would be

  1. Land thing PR
  2. Michael to create new job
  3. Gareth to setup job to run the script and experiment with how best to make the different options available.

@mhdawson
Copy link
Member

mhdawson commented Aug 2, 2017

@gareth-ellis I created this job for you to edit/configure to run the micro-benchmarks and set it so that it will only run on one of the 2 new machines from intel:

https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/

For now it will only run on:

test-nearform_intel-ubuntu1604-x64-1

Later on we can configure to run on either of the 2 new machines if thats appropriate.

I think you should have everything you need to proceed. You can land the PR and then work on configuration the jenkins job. Let me know if there is anything else you need.

@gareth-ellis
Copy link
Member Author

@nodejs/benchmarking anyone else want to take a look before I land? Would be good to get a few more LGTMs

@mcollina
Copy link
Member

mcollina commented Aug 3, 2017

I have a couple of PRs I would like to try this on.

@gareth-ellis gareth-ellis merged commit 3d788c2 into nodejs:master Aug 3, 2017
@gareth-ellis
Copy link
Member Author

Landed in 3d788c2

@gareth-ellis
Copy link
Member Author

opened #127 to track progress

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants