Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ncclDebugLog to print messages longer than 1024 bytes #1215

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nzmsv
Copy link

@nzmsv nzmsv commented Mar 6, 2024

Instead of copying the message into the 1KB buffer, use the buffer to create a prefixed format string and then directly pass that to vfprintf. This makes it possible to print larger messages which can be handy for debugging.

@sjeaugey
Copy link
Member

sjeaugey commented Mar 7, 2024

Can you explain concretely why it's better than it was? I.e. what didn't work before that works now, and how it's achieved.

Also, from a quick glance at the code, it looks like when len == sizeof(buffer) we write \0 in buffer[len], which is out of bounds.

@nzmsv
Copy link
Author

nzmsv commented Mar 7, 2024

Thank you for looking at this!

Here's the use case:

char long_msg[1024];
memset(long_msg, 'A', sizeof(long_msg));
long_msg[1023] = 'B';
WARN("This is a long string: %s", long_msg);

Before the patch this would print:

[0] NCCL WARN This is a long string: AAAAAAAAAAA...AA (clipped to 1024 - prefix length)

After the patch this would print:

[0] NCCL WARN This is a long string: AAAAAAAAAAA...AAAAAAB (not clipped)

How is this achieved: long strings are not copied into the on-stack buffer and are passed directly to vfprintf, so there is no limit on the length of the string arguments.

It may seem a bit silly to support such long strings, but I ran into this in a real use case: printing out some JSON.

@nzmsv
Copy link
Author

nzmsv commented Mar 7, 2024

Also, from a quick glance at the code, it looks like when len == sizeof(buffer) we write \0 in buffer[len], which is out of bounds.

For len == sizeof(buffer) the following check will trigger:

if (len > sizeof(buffer)-1) {

so we'll move len back one character.

@sjeaugey
Copy link
Member

sjeaugey commented Mar 8, 2024

Ok, it took me a long time but I now understand what we do.

Instead of developping fmt with the variable arguments inside buffer, we append fmt to the buffer, and replace fmt % arguments by the va args only in the last printf to the output.

That works, but then a lot of the rest of the code seems incorrect. If we keep fmt in the buffer, then truncation could cause issues, as fmt could be missing arguments and break the final printf. Are we sure printf can handle well cases where we provide too many arguments and it's safe?

I think we should fail (or just print an error message) if the Header + fmt doesn't fit within the buffer size; that will simplify the code a lot. We don't want to truncate, rewind, add \n, \0 etc. So I'd want some refactoring of this code if we change our strategy completely, to make it clean and safe.

@nzmsv
Copy link
Author

nzmsv commented Mar 8, 2024

printf can safely handle too many arguments for the format; it's not having enough that's a problem (https://stackoverflow.com/questions/3578970/passing-too-many-arguments-to-printf).

The only weird case would be when we truncate the format string after a % but before the format specifier. This would result in an unsubstituted % in the output.

But I agree that this could be simplified a lot. Let me take another stab at it.

@nzmsv
Copy link
Author

nzmsv commented Mar 12, 2024

Cleaned this up a bit. I believe this version is always safe, even if the format string is too long.

@nzmsv
Copy link
Author

nzmsv commented Dec 16, 2024

Other uses also run into this limitation: #825

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants