-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: copy iq4_nl LUT into shared memory #10409
Conversation
It seems AMD and Intel suffer from the same issue. I wasn't aware of LDC divergence at all. Do you know why CUDA and ROCm don't run into this issue? Do they move the static LUT into shared memory automatically? Radeon Pro VIIBefore:
After:
Intel Arc A770Before:
After:
NVIDIA RTX 3090Before:
After:
|
Interesting, I didn't know AMD/Intel had similar issues.
I'm not sure. I see the LUT uses |
That's probably the case. There is some context here: #4773 (comment) |
Ah yeah, interesting how these kinds of details show up across APIs. |
NVIDIA hardware has performance issues with non-uniform indexing of constant loads (e.g. see https://resources.nvidia.com/en-us-nsight-developer-tools-mc/en-us-nsight-developer-tools/ldc-divergence). The iq4_nl lookup table was suffering from this. This change copies the LUT into shared memory.
I'm guessing this is probably about neutral on AMD/Intel, but I'm not sure.