Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up analyze-outcomes stage #144

Merged
merged 5 commits into from
Jan 4, 2024

Conversation

daverodgman
Copy link
Contributor

@daverodgman daverodgman commented Jan 2, 2024

Improve perf of the analyse-outcomes stage of CI, which especially matters because it's a serial stage of the CI pipeline.

Currently, it takes around 247s to run xz on the CI.

Testing locally, current xz command takes 130s. With -0, it takes 15s (so expect ~28.5s on CI). This increases compressed size from 38 MB to 54 MB.

Using -T8 (8 threads) reduces it further to 2.6s locally (expect ~5s on CI). There's no penalty for going too big on thread count, but no benefit beyond about 8-16.

lzma, gzip, bzip2, zip are all worse.

Also, use grep instead of awk to extract failures, which saves a few more seconds.

CI run: https://jenkins-mbedtls.oss.arm.com/blue/organizations/jenkins/mbed-tls-pr-head/detail/PR-8530-head/9/pipeline/

@daverodgman daverodgman added enhancement New feature or request needs: ci needs: review priority-high size-xs Estimated task size: extra small (a few hours at most) labels Jan 2, 2024
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to switch to grep, but only if the limitation is documented.

Not ok to use more threads than we have cores.

vars/analysis.groovy Outdated Show resolved Hide resolved
vars/analysis.groovy Outdated Show resolved Hide resolved
# Compress the failure list if it is large (for some value of large)
if [ "$(wc -c <failures.csv)" -gt 99999 ]; then
xz failures.csv
xz -0 -T8 failures.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -T8? AFAIK we don't have 8-core executors on the CI.

Beware of insisting on too much parallelism: this can create a load spike that slows down other executors, resulting in worse performance overall. You can't measure the impact just by running one job.

Using -T8 (8 threads) reduces it further to 2.6s locally (expect ~5s on CI). There's no penalty for going too big on thread count, but no benefit beyond about 8-16.

Measuring on a machine with a different number of cores is not useful to guess at the performance on the CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, -T1000 does not downgrade performance locally. So my assumption is that -T8 is fine and will benefit if the CI improves in the future. It looks like xz caps threads at CPU count (locally, I see improvement up to -T8), so my expectation is that this will work well on the CI.

I am aware of the load-spike concern, but as this is a serial stage, it is a win to steal CPU capacity from the parallel stages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see a reason to use -T8. Why not -T0?

As far as I understand, the actual number of threads is limited by the memory consumption (which I don't expect to matter at -O0), by the -T value and by the file size, but not by the number of cores. Looking at the xz source code, lzma_cputhreads() is called to determine what hardware_threads_get() will return when using -T0, and then the encoder only relies on hardware_threads_get() and not lzma_cputhreads().

Since 8 is an arbitrary number here, I'd want to see some benchmark that justifies it. Otherwise please use -T0 (the number of cores seems like a sensible default choice).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still unaddressed.

# Compress the failure list if it is large (for some value of large)
if [ "$(wc -c <failures.csv)" -gt 99999 ]; then
xz failures.csv
xz -0 -T8 failures.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lzma, gzip, bzip2, zip are all worse.

I'd expect them to result in worse compression, but maybe we're ok with less compression but less CPU consumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trade-off looked non-favourable in all cases; the closest was probably gzip, which is faster (in single-thread mode, but it doesn't have a multi-thread mode), but almost twice as big output - so xz -T<n> wins overall.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multithreaded mode may be counterproductive anyway, but have you tried pigz? (In a realistic scenario, not on your machine!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pigz -1 does look considerably better for both perf (0.36s locally) and size (25 MB, about 2x better).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we need to keep in mind that another element of the compromise is storage. That's the reason we compress, not to save on download time/bandwidth. A vast majority of outcome files are never consumed, but all are stored for some time. I don't know how much we're spending on storage, and in particular how the (compressed) outcome files compare with logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always up the storage budget if we need to. As it stands, an extra 18 MB / PR run is unlikely to cause big budget issues IMO.

Copy link
Contributor Author

@daverodgman daverodgman Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tracking the pigz suggestion (thank you) as a separate follow-up in #150, since that requires additional CI work beyond the scope of this PR.

@tom-cosgrove-arm tom-cosgrove-arm dismissed their stale review January 2, 2024 13:51

LC_ALL in wrong place

Signed-off-by: Dave Rodgman <[email protected]>
Copy link
Contributor

@tom-cosgrove-arm tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tom-cosgrove-arm
Copy link
Contributor

CI is green

@gilles-peskine-arm are you happy enough with this now to approve?

Signed-off-by: Dave Rodgman <[email protected]>
@daverodgman
Copy link
Contributor Author

@gilles-peskine-arm I have updated to use -T0, please review

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me on code inspection.

Ok to merge with a passing CI run (either release or PR will do).

@gilles-peskine-arm gilles-peskine-arm added needs: ci approved Approved in review. May need additional CI. and removed needs: review labels Jan 3, 2024
@daverodgman
Copy link
Contributor Author

@daverodgman daverodgman merged commit 865d01b into Mbed-TLS:master Jan 4, 2024
1 check passed
@daverodgman daverodgman deleted the compression_perf branch January 4, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Approved in review. May need additional CI. enhancement New feature or request priority-high size-xs Estimated task size: extra small (a few hours at most)
Projects
Status: CI perf
Development

Successfully merging this pull request may close these issues.

3 participants