Use direct load and store instead of memcpy in simple cases #23352

yuyichao · 2017-08-19T15:20:59Z

Create emit_memcpy wrapper.
Simplify handling of jl_cgval_t on the caller side
Do some optimizations to avoid emitting memcpy for simple types.

These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other optimizations.

This is backported from #23240 where I saw a vectorization regression due to excess bitcast in the loop generated by sroa and instcombine. Not sure how this can be triggered otherwise.

yuyichao · 2017-08-19T15:21:15Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-08-19T19:00:01Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

vtjnash · 2017-08-21T16:25:49Z

src/cgutils.cpp

+    // If the types are small and simple, use load and store directly.
+    // Going through memcpy can cause LLVM (e.g. SROA) to create bitcasts between float and int
+    // that interferes with other optimizations.
+    if (sz <= 64) {


1-2 cache lines seems to big. Wouldn't 8-16 bytes make more sense?

1 cache line at most? This is just the max vector size we need to deal with atm.

vtjnash · 2017-08-21T16:38:28Z

src/cgutils.cpp

+        auto dstel = dstty->getElementType();
+
+        bool direct = false;
+        if (srcel->isSized() && srcel->isSingleValueType() && DL.getTypeStoreSize(srcel) == sz) {


Code of this form is why

These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other optimizations.

happens. The desired type of the slot (float or int) should be an argument to this function, which should also help make the logic much simpler here.

AFAICT it happens because of the pointer bitcast to i8*. That's why bitcasts are removed on the caller side. On the caller side, after the bitcast to i8* is removed, the desired types are already in the type of the pointer.

* Create emit_memcpy wrapper. * Simplify handling of `jl_cgval_t` on the caller side * Do some optimizations to avoid emitting memcpy for simple types. These can cause LLVM (e.g. SROA) to emit unnecessary bitcast's that interfere with other optimizations.

yuyichao · 2017-09-18T20:36:10Z

Rebased.

yuyichao requested review from vtjnash and Keno August 19, 2017 15:21

vtjnash requested changes Aug 21, 2017

View reviewed changes

yuyichao force-pushed the yyc/codegen/memcpy branch from 3df2de0 to 1546621 Compare August 25, 2017 01:21

yuyichao mentioned this pull request Sep 18, 2017

Much more aggressive alloc_elim_pass! #23240

Closed

yuyichao force-pushed the yyc/codegen/memcpy branch from 1546621 to 249d629 Compare September 18, 2017 20:35

yuyichao mentioned this pull request Sep 19, 2017

Fix ccall return value boxing on ARM/AArch64 #23739

Merged

yuyichao merged commit 7457622 into JuliaLang:master Sep 19, 2017

yuyichao deleted the yyc/codegen/memcpy branch September 19, 2017 13:44

pchintalapudi mentioned this pull request Feb 23, 2022

Remove uses of PointerType::getElementType for opaque pointers #44310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use direct load and store instead of memcpy in simple cases #23352

Use direct load and store instead of memcpy in simple cases #23352

yuyichao commented Aug 19, 2017

yuyichao commented Aug 19, 2017

nanosoldier commented Aug 19, 2017

vtjnash Aug 21, 2017

yuyichao Aug 21, 2017

vtjnash Aug 21, 2017

yuyichao Aug 21, 2017

yuyichao commented Sep 18, 2017

Use direct load and store instead of memcpy in simple cases #23352

Use direct load and store instead of memcpy in simple cases #23352

Conversation

yuyichao commented Aug 19, 2017

yuyichao commented Aug 19, 2017

nanosoldier commented Aug 19, 2017

vtjnash Aug 21, 2017

Choose a reason for hiding this comment

yuyichao Aug 21, 2017

Choose a reason for hiding this comment

vtjnash Aug 21, 2017

Choose a reason for hiding this comment

yuyichao Aug 21, 2017

Choose a reason for hiding this comment

yuyichao commented Sep 18, 2017