[plugins] speedup big journal collection #3873

champtar · 2024-12-11T17:13:21Z

Instead of having journalctl format all the logs,
estimate the number of lines of logs needed to fill the log_size limit.
On a EL 9.5 server with SSD with 500MB of formated logs, the logs runtime goes from 32s to 11s.

Fixes #3615

# journalctl --version
systemd 252 (252-46.el9_5.2.alma.1)
# journalctl --no-pager | wc -c
533252817
# journalctl --disk-usage 
Archived and active journals take up 1.6G in the file system.
# sos report -o logs --batch -vvv
### Before patch
[plugin:logs] collected plugin 'logs' in 32.05628514289856
### After patch
[plugin:logs] collected plugin 'logs' in 10.8226797580719

Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines

Is the commit message split over multiple lines and hard-wrapped at 72 characters?
Is the subject and message clear and concise?
Does the subject start with [plugin_name] if submitting a plugin patch or a [section_name] if part of the core sosreport code?
Does the commit contain a Signed-off-by: First Lastname [email protected]?
Are any related Issues or existing PRs properly referenced via a Closes (Issue) or Resolved (PR) line?
Are all passwords or private data gathered by this PR obfuscated?

packit-as-a-service · 2024-12-11T17:16:43Z

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

sudo yum install -y dnf-plugins-core on RHEL 8
sudo dnf install -y dnf-plugins-core on Fedora
dnf copr enable packit/sosreport-sos-3873
And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

Instead of having journalctl format all the logs, estimate the number of lines of logs needed to fill the log_size limit. On a EL 9.5 server with SSD with 500MB of formated logs, the logs runtime goes from 32s to 11s. Signed-off-by: Etienne Champetier <[email protected]>

TurboTurtle

Line-count limiting is never going to be accurate, and will almost certainly result in not just wildly different collections with regards to any limiting put in place (either default or by end user usage of options), but also inconsistency between collections within the same run that use the inherited size limit.

There is no control over what a given line's sizing will be, and averaging over a large number of lines doesn't mitigate that - we could have 1000 extremely small lines and then in the next 100, 200, or whatever, have extremely dense lines.

On top of that, this now causes us to execute 2 journalctl commands for every collection, which is inefficient for execution and memory consumption throughout a given sos run.

Runtimes are something we look to improve when it's controlled entirely within our own code, but if an external tool simply takes a certain amount of time to execute when we call it, we shouldn't be jumping through hoops in the name of runtime "optimization" if it means non-deterministic results. At the end of the day, sos is a tool that doesn't have high performance being a more important requirement than stability and predictability.

champtar · 2024-12-11T22:52:13Z

Line-count limiting is never going to be accurate, and will almost certainly result in not just wildly different collections with regards to any limiting put in place (either default or by end user usage of options), but also inconsistency between collections within the same run that use the inherited size limit.

That is why I use *1.5, but then still use the size_limit, so it still returns 100MB of logs, there is no inconsistency most of the time, instead of tailing 500MB and keeping 100MB, we tail ~150MB and still keep 100MB

with my dev server with SSD I go from 30 to 10s, but I've shown worse cases, going from 10min to 2min (#3615 (comment))

There is no control over what a given line's sizing will be, and averaging over a large number of lines doesn't mitigate that - we could have 1000 extremely small lines and then in the next 100, 200, or whatever, have extremely dense lines.

That's correct and why I use *1.5, we can end up with less than 100MB, but 100MB is already arbitrary.
In my case speed is more important than reaching the arbitrary 100MB limit.

On top of that, this now causes us to execute 2 journalctl commands for every collection, which is inefficient for execution and memory consumption throughout a given sos run.

it's inefficient, but 3x more efficient than the current solution (or even 5x in more extreme cases)

Runtimes are something we look to improve when it's controlled entirely within our own code, but if an external tool simply takes a certain amount of time to execute when we call it, we shouldn't be jumping through hoops in the name of runtime "optimization" if it means non-deterministic results. At the end of the day, sos is a tool that doesn't have high performance being a more important requirement than stability and predictability.

For context I have customers in the field who refuse to grab an sos report before reboot because it's too slow, having their service up is their concern, having data to troubleshoot is my problem :(

I have higher hopes improving sos execution speed than my customer behavior, so I'll keep thinking about solutions ;)

pmoravec · 2024-12-12T09:00:27Z

This comment does not want to assess the particular approach in this PR, it deals with rather generic / philosophical / high level point of view.

There are contradicting requirements here, how to limit logs to be collected:

by time ("cover last day")
by lines / log records ("collect 100k lines of logs to have some minimal statistically significant sample of system behaviour")
by amount of data ("collect at most 100MB as I want small tarball")
get some data in reasonably small time

As a support engineer analysing a sosreport, my requirements are "limit by time and/or log records".

As a support engineer or customer generating sosreport, my requirement is "limit amount of data" - and also execution time.

The requirements are contradicting, the current code follows these priorities imho in this ordering:

limit by amount of data by default
optionally limit by time (--since option)
try to achieve some reasonable execution time

So we much more focus on the generating part than on the analysing part (while this PR aims to balance contradicting "time vs space" requirements of generating part, plus adding more weight to the analysing part - which sounds good in general).

Any strategy balancing the requirements will be just a heuristic, with its corner cases where it behaves badly. Jake's arguments are valid - but my devil's advocate can easily find a counter-example where current approach slowly collects insufficient data.

Also @champtar data come from an experiment just on one system. How generic the system is? How much differently this approach would behave on different systems? I dont know..

Currently, we do size-correct execution of something like journalctl --since=boot | tail -c 100MB. Maybe an alternative would be journalctl --since=boot -r | head -c 100MB | tac? It would need the extra tac that can be difficult to orchestrate though. And it is yet another approach I am unable to assess its caveats or generic performance /o.

pmoravec · 2024-12-12T11:01:43Z

Yet alternate idea (kudos to @jcastill) is to have --journal-lines option that would cause --journal-size option to be ignored. A user - knowing the system specifics well - can use this option to speedup the journal logs collection, with the risk of collecting random-ish amount of MBs of journal logs.

jcastill · 2024-12-12T11:34:23Z

The idea is that we already have the code in place to collect a number of lines in journalctl via add_journal() , so we could just add a new command line option and pass it through in any plugin we want to implement this limitation.

champtar · 2024-12-12T17:00:37Z

Even if arbitrary, I think a default size limit is the best
For operator, it can give a predictable execution time and archive size
For dev it gives as much log as possible while keeping operators happy

Also @champtar data come from an experiment just on one system. How generic the system is? How much differently this approach would behave on different systems? I dont know..

I've provided numbers on 3 other systems long ago #3615 (comment) (click on details)
TLDR journalctl is slow, and formatting 1GB of logs to keep 100MB is obviously wasteful
note: if I read sos code correctly the tail implementation (AsyncReader) uses 2xsizelimit of memory, we can avoid that

Maybe an alternative would be journalctl --since=boot -r | head -c 100MB | tac

There are 2 issues with this approach I suggested some time ago (#3615 (comment)):

it doesn't handle timeout properly (end up with nothing, right now we end up with some old logs but not the lastest ones)
it breaks multiline logs

I think I can fix both, will provide an PR when I have something so we can discuss further

champtar · 2024-12-13T05:00:33Z

New version: #3879

champtar force-pushed the journal-speed branch from 7590868 to 6c741e6 Compare December 11, 2024 17:18

champtar marked this pull request as draft December 11, 2024 17:23

champtar force-pushed the journal-speed branch from 6c741e6 to 938d6d5 Compare December 11, 2024 18:08

champtar marked this pull request as ready for review December 11, 2024 18:09

TurboTurtle requested changes Dec 11, 2024

View reviewed changes

champtar closed this Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plugins] speedup big journal collection #3873

[plugins] speedup big journal collection #3873

champtar commented Dec 11, 2024 •

edited

Loading

packit-as-a-service bot commented Dec 11, 2024

TurboTurtle left a comment

champtar commented Dec 11, 2024

pmoravec commented Dec 12, 2024

pmoravec commented Dec 12, 2024

jcastill commented Dec 12, 2024

champtar commented Dec 12, 2024 •

edited

Loading

champtar commented Dec 13, 2024

[plugins] speedup big journal collection #3873

[plugins] speedup big journal collection #3873

Conversation

champtar commented Dec 11, 2024 • edited Loading

packit-as-a-service bot commented Dec 11, 2024

TurboTurtle left a comment

Choose a reason for hiding this comment

champtar commented Dec 11, 2024

pmoravec commented Dec 12, 2024

pmoravec commented Dec 12, 2024

jcastill commented Dec 12, 2024

champtar commented Dec 12, 2024 • edited Loading

champtar commented Dec 13, 2024

champtar commented Dec 11, 2024 •

edited

Loading

champtar commented Dec 12, 2024 •

edited

Loading