-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use direct load and store instead of memcpy in simple cases #23352
Conversation
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
// If the types are small and simple, use load and store directly. | ||
// Going through memcpy can cause LLVM (e.g. SROA) to create bitcasts between float and int | ||
// that interferes with other optimizations. | ||
if (sz <= 64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1-2 cache lines seems to big. Wouldn't 8-16 bytes make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 cache line at most? This is just the max vector size we need to deal with atm.
auto dstel = dstty->getElementType(); | ||
|
||
bool direct = false; | ||
if (srcel->isSized() && srcel->isSingleValueType() && DL.getTypeStoreSize(srcel) == sz) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code of this form is why
These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other optimizations.
happens. The desired type of the slot (float or int) should be an argument to this function, which should also help make the logic much simpler here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT it happens because of the pointer bitcast to i8*
. That's why bitcasts are removed on the caller side. On the caller side, after the bitcast to i8*
is removed, the desired types are already in the type of the pointer.
3df2de0
to
1546621
Compare
* Create emit_memcpy wrapper. * Simplify handling of `jl_cgval_t` on the caller side * Do some optimizations to avoid emitting memcpy for simple types. These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other optimizations.
1546621
to
249d629
Compare
Rebased. |
Create emit_memcpy wrapper.
Simplify handling of
jl_cgval_t
on the caller sideDo some optimizations to avoid emitting memcpy for simple types.
These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other optimizations.
This is backported from #23240 where I saw a vectorization regression due to excess
bitcast
in the loop generated bysroa
andinstcombine
. Not sure how this can be triggered otherwise.