-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up analyze-outcomes stage #144
Conversation
Signed-off-by: Dave Rodgman <[email protected]>
f6500ab
to
c662496
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to switch to grep, but only if the limitation is documented.
Not ok to use more threads than we have cores.
vars/analysis.groovy
Outdated
# Compress the failure list if it is large (for some value of large) | ||
if [ "$(wc -c <failures.csv)" -gt 99999 ]; then | ||
xz failures.csv | ||
xz -0 -T8 failures.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -T8
? AFAIK we don't have 8-core executors on the CI.
Beware of insisting on too much parallelism: this can create a load spike that slows down other executors, resulting in worse performance overall. You can't measure the impact just by running one job.
Using -T8 (8 threads) reduces it further to 2.6s locally (expect ~5s on CI). There's no penalty for going too big on thread count, but no benefit beyond about 8-16.
Measuring on a machine with a different number of cores is not useful to guess at the performance on the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, -T1000
does not downgrade performance locally. So my assumption is that -T8
is fine and will benefit if the CI improves in the future. It looks like xz
caps threads at CPU count (locally, I see improvement up to -T8
), so my expectation is that this will work well on the CI.
I am aware of the load-spike concern, but as this is a serial stage, it is a win to steal CPU capacity from the parallel stages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see a reason to use -T8
. Why not -T0
?
As far as I understand, the actual number of threads is limited by the memory consumption (which I don't expect to matter at -O0
), by the -T
value and by the file size, but not by the number of cores. Looking at the xz
source code, lzma_cputhreads()
is called to determine what hardware_threads_get()
will return when using -T0
, and then the encoder only relies on hardware_threads_get()
and not lzma_cputhreads()
.
Since 8 is an arbitrary number here, I'd want to see some benchmark that justifies it. Otherwise please use -T0
(the number of cores seems like a sensible default choice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still unaddressed.
vars/analysis.groovy
Outdated
# Compress the failure list if it is large (for some value of large) | ||
if [ "$(wc -c <failures.csv)" -gt 99999 ]; then | ||
xz failures.csv | ||
xz -0 -T8 failures.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lzma, gzip, bzip2, zip are all worse.
I'd expect them to result in worse compression, but maybe we're ok with less compression but less CPU consumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trade-off looked non-favourable in all cases; the closest was probably gzip, which is faster (in single-thread mode, but it doesn't have a multi-thread mode), but almost twice as big output - so xz -T<n>
wins overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multithreaded mode may be counterproductive anyway, but have you tried pigz
? (In a realistic scenario, not on your machine!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pigz -1
does look considerably better for both perf (0.36s locally) and size (25 MB, about 2x better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, we need to keep in mind that another element of the compromise is storage. That's the reason we compress, not to save on download time/bandwidth. A vast majority of outcome files are never consumed, but all are stored for some time. I don't know how much we're spending on storage, and in particular how the (compressed) outcome files compare with logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always up the storage budget if we need to. As it stands, an extra 18 MB / PR run is unlikely to cause big budget issues IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tracking the pigz
suggestion (thank you) as a separate follow-up in #150, since that requires additional CI work beyond the scope of this PR.
Signed-off-by: Dave Rodgman <[email protected]>
Signed-off-by: Dave Rodgman <[email protected]>
Signed-off-by: Dave Rodgman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CI is green @gilles-peskine-arm are you happy enough with this now to approve? |
Signed-off-by: Dave Rodgman <[email protected]>
@gilles-peskine-arm I have updated to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me on code inspection.
Ok to merge with a passing CI run (either release or PR will do).
Improve perf of the analyse-outcomes stage of CI, which especially matters because it's a serial stage of the CI pipeline.
Currently, it takes around 247s to run xz on the CI.
Testing locally, current xz command takes 130s. With -0, it takes 15s (so expect ~28.5s on CI). This increases compressed size from 38 MB to 54 MB.
Using -T8 (8 threads) reduces it further to 2.6s locally (expect ~5s on CI). There's no penalty for going too big on thread count, but no benefit beyond about 8-16.
lzma, gzip, bzip2, zip are all worse.
Also, use grep instead of awk to extract failures, which saves a few more seconds.
CI run: https://jenkins-mbedtls.oss.arm.com/blue/organizations/jenkins/mbed-tls-pr-head/detail/PR-8530-head/9/pipeline/