Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTC Address Patching #802

Closed
mondus opened this issue Feb 18, 2022 · 10 comments
Closed

RTC Address Patching #802

mondus opened this issue Feb 18, 2022 · 10 comments

Comments

@mondus
Copy link
Member

mondus commented Feb 18, 2022

RTC implementation currently requires addresses of agent and message vectors to be passed at runtime to avoid recompiling as these addresses change. They change frequently via swaps or whenever they are resized. Baking the raw address into the binary would be much faster. It would allow the compiler to provide a direct lookup in memory and reduce latency of offsetting into the runtime address which must be either cached in registers or looked-up from memory.

This has not previously been explores as RTC compilation and module loading is slooooow. There are two separate issues however. Compilation can be speeded up (#602) AND the resulting ptx could be patched directly and module creation time to embed the address directly.

nvrtc supports this patching with the cuModuleLoadDataEx function which has an CUjit_option. The option allows (amongst other things) CU_JIT_GLOBAL_SYMBOL_NAMES and CU_JIT_GLOBAL_SYMBOL_ADDRESSES. It effectively means that once you have ptx you can create a module by patching symbol addresses. E.g. We can store the compiled ptx and recreate the module with patched values whenever they change.

Jitify supports the use of CUjit_options although the interface for accessing this is unclear.

@Robadob
Copy link
Member

Robadob commented Feb 18, 2022

Jitify supports the use of CUjit_options although the interface for accessing this is unclear.

CUDAKernel is an internal class, best I can see. You would need to reimplement the deserialize code I presume and call CUDAKernel yourself directory (or modify Jitify to expose the options and put in a pull request).

@mondus
Copy link
Member Author

mondus commented Feb 18, 2022

or modify Jitify to expose the options and put in a pull request

This would be the preferred option.

@Robadob
Copy link
Member

Robadob commented Feb 18, 2022

My primary concern for this is how we would represent the arrays as symbols.

__device__ void *my_variable_array; Is still a symbol holding a pointer that has to be looked up.

So really we'd want something like __device__ char my_variable_array[1];, and then read 'out of bounds', knowing it will still be in bounds. As a device symbol array requires a fixed size. Or some other thing, where we make a false symbol point to memory we want, then we reference it and reinterpret the pointer as what we actually want.

I doubt CUDA would report this out of bounds, and prevent us from doing so. But there's always the potential that it would cause the compiler to optimise differently.

@Robadob
Copy link
Member

Robadob commented Feb 21, 2022

Would be fairly trivial to expose CUjit options on deserialize, just need to forward them through the lower 2 items of that list. Have applied that here: https://github.com/Robadob/jitify/tree/expose_cujit

It's niche enough, that they may request a usage example if a PR is created.

Similarly, we'd require some bigger fgpu2 changes, to modify curve to instead deserialize with specific options. Worth doing a smaller proof of concept first.

@Robadob
Copy link
Member

Robadob commented Feb 22, 2022

I started writing a proof of concept here.

When executed, and the call to KernelInstantiation::deserialize() is triggered, CUDA_ERROR_NOT_SUPPORTED is returned by cuModuleLoadDataEx().

image

This isn't among the documented return values, so it's not clear if that's a documentation issue, I've missed something, or I did make a mistake.

There's a sample which passes (different) Cujit options here, and I can't see any mistakes in what I've done at a glance.

Googling one of the enums, the only relevant result (e.g. not documentation or headers) seems to be this comment on an old numba issue. Where someone suggests it's not clearly documented, but actually appears to refer to copying data from host to device. So it's not clear whether anyone else has used this particular cutjit functionality.

Submitting a bug report, with the proof of concept (and stating it gives an undocumented return value), might get you a clearer explanation.

@Robadob
Copy link
Member

Robadob commented Feb 23, 2022

One option I didn't try, is passing -rdc=true, and extern the device symbol. (Extern-ing without rdc is redundant, get a warning saying it will be treated as static)

Nope, just tested it no warning now but same unsupported return code.

@mondus
Copy link
Member Author

mondus commented Feb 24, 2022

Same unsupported return code when doing this directly in nvrtc and outside of jitify: https://github.com/mondus/CUJIT_Experiment

@mattmartineau
Copy link

I have asked internally about this and I'll let you know when I hear back.

If my current approach doesn't work I'll file a bug internally so someone can explain what is happening. If it comes to that do you approve my sharing of the reproducer you discuss above?

@mondus
Copy link
Member Author

mondus commented May 27, 2022

@mattmartineau Of course. No problem.

@Robadob
Copy link
Member

Robadob commented Jul 3, 2023

This was confirmed by @mattmartineau as not possible.

@Robadob Robadob closed this as completed Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants