-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmark: refactor buffer benchmarks #26418
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be removing some of the parameter values, especially ones that can trigger different code paths.
In general, I think instead of making these changes, it'd be better to allow for a list of benchmark file filters in CI so we can exclude specific benchmarks we know do not exercise code paths being tested.
@mscdex I went through the results on a CI machine and checked if the outcome was mainly identical or not. Most of the cases are redundant and do not provide a actual gain. If anyone wants to execute specific parameters that are not provided it's also always possible to add those locally. If the outcome is substantial, it's always possible to add them back to the code base. The filter as you suggested them does not seem to solve the general problem that we have a lot of entries that are pretty much identical. E.g. a lot of buffer functions have optional length or encoding options but without actually changing much on any benchmark. I really tried hard to only remove ones that seemed well covered, so I kindly ask you to go through the removed entries to outline what entries you believe should still be kept and why. The problem does not only manifest itself on the CI. More important to me, it's an issue running these locally. I would like to switch from "benchmark most code branches of a specific function" to "cover most functions with your most common feature set". |
One problem with that is not everyone making changes to runtime code will necessarily know to add/check those (missing) cases that may be affected. The people who have committed benchmarks typically are much more knowledgeable about that sort of thing because they were making performance-related changes in the same PR when the benchmark (or new parameters) was added. |
In addition to being able to specify multiple filters in CI, I think being able to somehow specify groups of configurations for a particular benchmark file would be beneficial because some parameters do not need to be tried more than once or something they don't make sense when combined with other parameter values. |
@mscdex I absolutely agree that mostly the person who wrote the benchmark knows best what cases are effected. That's also the crux: we have a lot of benchmarks that only check edge cases and cases that were important for specific code that changes over time and might not follow the original intent anymore. Relying on the existent ones as a generic measure is not ideal if we test each parameter. Instead, we could just ask to add another benchmark if we feel like the code change should be tested and is not yet covered by the existent benchmarks. I believe that we gain most if the benchmarks focus on the generic feature without distinguishing each parameter. We really have a lot of redundant cases here. |
That would indeed be good. I'll open a help wanted issue for it. |
@mscdex would you be so kind and propose a way to reduce the overall runtime of the benchmarks? They are currently clearly very far away from "micro" benchmarks (the buffer benchmarks take more than a whole day!). We test a whole lot of cases that are super edge cases and only specific to a single code path that is rarely taken. We could go ahead and have a "generic" folder in which we have the benchmarks that are more generic and have other benchmarks that test a whole lot of other cases separate. I personally do not believe that we really need these though as it should be pretty straight forward to add the extra cases when required (it just requires adding some parameters). |
@nodejs/benchmarking PTAL |
@BridgeAR I think having the grouped configurations will go a long way to helping reduce the time taken without having to remove useful default parameter values. |
@mscdex I don't think that it'll have enough influence. It would be great to have the buffer benchmarks run in a reasonable time again, independent of the grouping. Especially as it will take quite some work setting up the groups. |
@nodejs/tsc PTAL. I think this is quite an important PR: our current If we only reduce the iterations for the current variations, we'd have to reduce it so significantly, that the benchmark might not have a high statistical significance anymore. Thus the combination of reducing the iteration count and removing some variations (mostly just edge cases). I think it would be good to have a recommendation what cases we should handle in our benchmarks (I suggest no or rarely edge cases) and what the maximum runtime of our benchmarks (I suggest an average of around 1-2 second per variation and to try to limit the variations to the most common ones). Therefore I would like the TSC to decide what should be done to reduce the overall runtime and if this PR is OK as it is or what alternatives should be used instead. |
Currently the buffer benchmarks take significantly too long to complete. This drastically reduces the overall runtime by removing obsolete checked variations and reducing the iteration count. It also improves the benchmarks by removing the deprecated `new Buffer(size)` usage and some other small improvements.
2e89053
to
f0b3656
Compare
FWIW, filtering already exists sort of but is completely not obvious. You sort of need to trick Jenkins into applying the necessary command-line parameters by sneaking them into seemingly unrelated boxes that happen to just be put into the command line at the right place. You can sneak in |
I definitely can see both sides of this and haven't yet formed a solid opinion. I don't think our default benchmarks should take forever to run, but I also think great care and consideration should be taken in adjusting them to run faster. I'm not saying that care and consideration hasn't been taken here. It's more like I don't know the appropriate level of care and consideration when I see it, so I don't know. |
I do not think it's possible to come to a conclusion on this. In order to provide an informed opinion, we would have to backtrack why every single variation was added, as this worsened over time. This is a gargantuan task, considering that this is changing 23 files. Microbenchmarks are useful only if they can detect a regression. However, they take so long that actually they are not worth running, essentially negating their purpose. Therefore, I'm in favor for landing this change. |
I think we have similar problems for other areas of the benchmarks as well. Instead of removing would it make any sense to start with a "quick" option that runs a subset for each category with a target of 1 hour or less for each category? That might be something people would be more likely to run. Once that was in place we could then decide if "quick" or "full" should be the default. |
Even if the filters would be more obvious: I don't think it's a good idea to have the default run too long. What is important to be able to run arbitrary configurations without having to change any code and that is indeed still possible with our CI (and that's great!).
Some hints: our number of iterations is almost always too high. A benchmark should indicate relevant results after significantly less iterations than we normally do. As a ball park: How many iterations would a user use as in a very tight loop? Increase that by factor ten and hopefully reach the CPU limit. For most benchmarks each of such variation should take a few hundred milliseconds but not more. I'll go through the changes to outline the main differences as I see them. |
6 days ago in #26418 (comment), @sam-github suggested that a vote was premature and encouraged further work towards consensus. But no one has responded since. We can encourage conversation (as @sam-github did) but we can't force it. Our options (other than "do nothing" or "close this PR") are to continue to encourage conversation or else to have a TSC vote. @mscdex @BridgeAR Any thoughts on Sam's comments? @sam-github If there is no further participation in conversation, would you be inclined to resume the vote? Or is people simply opting out of a discussion insufficient grounds to move to a TSC vote in your opinion? |
No strong opinion either way. |
@Trott I've already described alternatives to this PR that I believe would be better. As far as running benchmarks on a regular interval outside of PRs goes, that is something that could be considered outside of this current topic, as it's not a replacement for checking benchmarks before changes (which may appear to have performance implications) get merged in the first place. Also about default iterations, the reason they were/are so high is that you need to ensure you give V8 enough time to properly optimize/inline/deoptimize/etc. functions and code because chances are that is what would be happening in the real world in a long-lived node process. |
I think at least going through the PR and commenting on the specific things you do and don't object to would be helpful. That would allow landing all the non-controversial changes, and it would be easier to understand what your specific objections are (for bystanders, now being asked to vote :-(). At this point, I don't know what specifically you don't want removed. Also, the suggestion of a whole new config system for benchmarks is interesting, but seems pretty tangential to this. Wouldn't it just move code to config, and if the default config had less coverage, it doesn't seem like it would change your objections to removing "some parameter values". As is, this PR is too large to understand what is or is not controversial, and since benchmarks can always be added back again in the future, it looks like if it comes to a vote it will land as-is. |
I believe we always try and do that and no one is going to argue against that! At least to me it also seemed like we did that? The PR is open for almost three month and we discussed a couple of possibilities in the meanwhile (sadly without finding an agreement).
Absolutely.
There are definitely benchmark suites that test lots of cases and that require a very long run time to finish. I just do not see that being the case for any of our microbenchmarks that we currently have in our code base. Being able to run the individual benchmark quickly and often allows to detect regressions on PRs that touch this code base. But that is currently not possible due to the extreme run time. I have also not found any indication that the high iteration count and the big argument matrix brought as any benefit so far.
The V8 team has own benchmarks and tests that verify the behavior after significantly less iterations. The code should be optimized and inlined after a few iterations since there is no other code in-between. @nodejs/v8 @psmarshall I guess you could shed some light here? A couple cases that I removed would only hit the arguments default code path which would look like: function foo(input, encoding, callback) {
if (typeof encoding === 'function') {
callback = encoding;
encoding = 'utf8';
}
actualHeavyAPICall(input, encoding, callback);
}
foo('bar', function callback() {});
foo('bar', 'utf8', function callback() {}); I do not think we actually have to verify each argument variation all the time, since the main API is not changed by any of them. Instead, I would like to only test the main API with a few different input cases by default. On top of that we are always able to run more cases by explicitly providing the some arguments on the command line. Right now we do not run our buffer benchmarks at all because they take so long. By reducing the iterations (which makes them more like in a real world application) and removing edge cases, I believe that we make them more valuable. |
@nodejs/tsc is there anything I should / can do to move this forward? |
Ping @nodejs/tsc PTAL @apapirovski @fhinkel @Fishrock123 @joyeecheung @mhdawson @thefourtheye |
I'd easily +1 if we had a default and long option for the benchmarks with the default being what is proposed in this PR and the long option being what it was before. I'm not going to block but I can't convince myself to +1 either so I'll abstain. |
@Trott we agreed in the meeting that we should complete a vote. Do you want to manage completing the voting process? |
With 10 TSC members abstaining, 5 members voting in favor, and 3 members not (yet) voting, this passes and can land if CI is green. |
Currently the buffer benchmarks take significantly too long to complete. This drastically reduces the overall runtime by removing obsolete checked variations and reducing the iteration count. It also improves the benchmarks by removing the deprecated `new Buffer(size)` usage and some other small improvements. PR-URL: nodejs#26418 Reviewed-By: Matteo Collina <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Peter Marshall <[email protected]> Reviewed-By: Franziska Hinkelmann <[email protected]> Reviewed-By: Anatoli Papirovski <[email protected]> Reviewed-By: Rich Trott <[email protected]>
Landed in 7b80268 |
Currently the buffer benchmarks take significantly too long to complete. This drastically reduces the overall runtime by removing obsolete checked variations and reducing the iteration count. It also improves the benchmarks by removing the deprecated `new Buffer(size)` usage and some other small improvements. PR-URL: #26418 Reviewed-By: Matteo Collina <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Peter Marshall <[email protected]> Reviewed-By: Franziska Hinkelmann <[email protected]> Reviewed-By: Anatoli Papirovski <[email protected]> Reviewed-By: Rich Trott <[email protected]>
Currently the buffer benchmarks take significantly too long to
complete. This drastically reduces the overall runtime by removing
obsolete checked variations and reducing the iteration count.
It also improves the benchmarks by removing the deprecated
new Buffer(size)
usage and some other small improvements.Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passes