Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use direct load and store instead of memcpy in simple cases #23352

Merged
merged 1 commit into from
Sep 19, 2017

Conversation

yuyichao
Copy link
Contributor

  • Create emit_memcpy wrapper.

  • Simplify handling of jl_cgval_t on the caller side

  • Do some optimizations to avoid emitting memcpy for simple types.

    These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other optimizations.

This is backported from #23240 where I saw a vectorization regression due to excess bitcast in the loop generated by sroa and instcombine. Not sure how this can be triggered otherwise.

@yuyichao yuyichao requested review from vtjnash and Keno August 19, 2017 15:21
@yuyichao
Copy link
Contributor Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

// If the types are small and simple, use load and store directly.
// Going through memcpy can cause LLVM (e.g. SROA) to create bitcasts between float and int
// that interferes with other optimizations.
if (sz <= 64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1-2 cache lines seems to big. Wouldn't 8-16 bytes make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 cache line at most? This is just the max vector size we need to deal with atm.

auto dstel = dstty->getElementType();

bool direct = false;
if (srcel->isSized() && srcel->isSingleValueType() && DL.getTypeStoreSize(srcel) == sz) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code of this form is why

These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other optimizations.

happens. The desired type of the slot (float or int) should be an argument to this function, which should also help make the logic much simpler here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT it happens because of the pointer bitcast to i8*. That's why bitcasts are removed on the caller side. On the caller side, after the bitcast to i8* is removed, the desired types are already in the type of the pointer.

* Create emit_memcpy wrapper.
* Simplify handling of `jl_cgval_t` on the caller side
* Do some optimizations to avoid emitting memcpy for simple types.

  These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other
  optimizations.
@yuyichao
Copy link
Contributor Author

Rebased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants