refactor(util): make RampingValue vectorizer-friendly #4770

Swiftb0y · 2022-05-26T17:18:43Z

Previously, invocations of RampingValue::getNext() caused inter-
dependence between loop iterations, making vectorization
impossible. This approach removes the state from RampingValue
and thus also the loop-iteration interdependence. getNext()
has been replaced by getNth(int step).
While multiplication (getNth)is technically more expensive than
addition (getNext), the vectorization possibility results in a
1.1 - 6x speedup depending on optimizer-agressiveness.

See this microbenchmark: https://www.quick-bench.com/q/PHqdbeYORuRS_x6hLtN-tBgvV0g

uklotzde

Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.

Target 2.3 instead of main?

Be-ing · 2022-05-26T21:05:12Z

Interesting... I'm confused how this is vectorizable considering one of the inputs to the calculation changes with every iteration.

Swiftb0y · 2022-05-26T21:36:52Z

Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.

Target 2.3 instead of main?

Sure.

Interesting... I'm confused how this is vectorizable considering one of the inputs to the calculation changes with every iteration.

If I understand correctly, you are asking why this is vectorizable even though getNth is depending on the loop counter? I can't really come up with a good answer myself to be honest. Most compilers can explain their reasoning for (not) vectorizing. For example in GCC with -ftree-vectorizer-verbose=1 -fopt-info-vec-missed (or -fopt-info-vec-missed). Using that, you can see that the autovectorizer fails because of m_value += m_increment since that creates state that is dependent on the previous loop iteration.

Swiftb0y · 2022-05-26T22:00:25Z

Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.

I already explained the reasoning quite verbosely in the commit, should I just copy the same to the sourcefile?

uklotzde · 2022-05-26T22:38:26Z

Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.

I already explained the reasoning quite verbosely in the commit, should I just copy the same to the sourcefile?

Ask yourself: Do you read the commit messages about the code you are just editing? Everything that belongs to the code goes into the code base. The commit messages only describe the transition, not the current state (in terms of a state machine).

daschuer

LGTM

However, non of the loops is vectorized.
A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)

I have played a bit with this but these loops are all to big or with function calls that can not be vectorized.

daschuer · 2022-05-26T22:35:36Z

lib/reverb/Reverb.cc

@@ -454,7 +454,7 @@ void MixxxPlateX2::processBuffer(const sample_t* in, sample_t* out, const uint f

    // loop through the buffer, processing each sample
    for (uint i = 0; i + 1 < frames; i += 2) {
-        sample_t mono_sample = send.getNext() * (in[i] + in[i + 1]) / 2;
+        sample_t mono_sample = send.getNth(i / 2) * (in[i] + in[i + 1]) / 2;
        PlateStub::process(mono_sample, decay, &out[i], &out[i+1]);


This loop is not vecorized. It is not possible because the process() call.

daschuer · 2022-05-26T22:42:36Z

src/effects/builtin/echoeffect.cpp

@@ -191,12 +191,14 @@ void EchoEffect::processChannel(const ChannelHandle& handle, EchoGroupState* pGr
            pGroupState->prev_feedback,
            bufferParameters.framesPerBuffer());

+    int rampIndex = 0;
    //TODO: rewrite to remove assumption of stereo buffer
    for (SINT i = 0;


not vectorized

daschuer · 2022-05-26T22:50:23Z

src/effects/builtin/flangereffect.cpp

@@ -195,13 +195,15 @@ void FlangerEffect::processChannel(const ChannelHandle& handle,
    CSAMPLE* delayLeft = pState->delayLeft;
    CSAMPLE* delayRight = pState->delayRight;

+    int rampIndex = 0;
    for (SINT i = 0;


not vectorized

daschuer · 2022-05-26T23:00:14Z

@uklotzde merge?

Be-ing · 2022-05-26T23:02:27Z

A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)

That means iterating over samples instead of frames... ugly :/

Swiftb0y · 2022-05-26T23:21:05Z

However, non of the loops is vectorized.

Yes, none of the loops in the built-in effects are actually vectorized because of other interdependencies. I only changed what I needed so they fit the new API (since the mutable getNext approach was not constexpr friendly).

daschuer · 2022-05-26T23:29:10Z

A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)

That means iterating over samples instead of frames... ugly :/

No, not at all, see:

mixxx/src/util/sample.cpp

Line 145 in b0601aa

for (int i = 0; i < numSamples / 2; ++i) {

Swiftb0y · 2022-05-26T23:35:34Z

A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)

Can you elaborate? IMO both of these preconditions are not necessary. Iterating using a size_t should be just as valid as iterating over a uint64_t or ptrdiff_t (which is what SINT actually is). Much more important for vectorization is that the values are actually aligned (to the size of the vector registers, not the contained values) otherwise the CPU is either much slower or aborts straight away.

That means iterating over samples instead of frames... ugly :/

There are different trade-offs for iterating over adjacent frames instead of samples. Its a similar trade-off to the array-of-structs vs. struct-of-arrays problem encountered when doing data-driven design. The performance usually depends on the access patterns to the data (because of cache). The struct-of-arrays approach is less OOP-like but allows for vectorization in many more cases (though this is not necessarily true as modern CPUs can load data into their vector registers with variable offsets). Its also cache-friendlier when iterating over arrays sequentially instead of in parallel. I think the design which is more present in the audio industry would be to have separate arrays of audio data instead of our interleaved design, this would also make it much easier to migrate our engine away from the stereo-audio assumption.

Be-ing · 2022-05-26T23:56:51Z

 for (int i = 0; i < numSamples / 2; ++i) {

That requires the loop implementation to have a line of code handling each channel, exacerbating the assumption of stereo everywhere.

daschuer · 2022-05-27T00:04:59Z

Can you elaborate?

From your theory you are right. But I have made the experience when optimizing /src/util/sample.cpp that the pattern recognition of gcc works reliable using a signed iterator and go in single steps. It might be the case that the compiler has been improved since then. But as a rule of thumb it should be still valid.

I think the design which is more present in the audio industry would be to have separate arrays of audio data instead of our interleaved design, this would also make it much easier to migrate our engine away from the stereo-audio assumption.

I think GStreamer has changed that during its history just because of that. It will be a topic in parts of Mixxx as well when we introduce N-channel (Stems)

Swiftb0y · 2022-05-27T00:10:56Z

That requires the loop implementation to have a line of code handling each channel, exacerbating the assumption of stereo everywhere.

If you mean that we should use a buffer[channel][sample] array, then not necessarily, if your audio processing code is channel independent / mono, you simply add another nested loop that iterates over every channel.

Swiftb0y · 2022-05-27T00:15:38Z

But I have made the experience when optimizing /src/util/sample.cpp that the pattern recognition of gcc works reliable using a signed iterator and go in single steps

I wonder in which GCC version these observations were made. I'm sure they don't apply anymore in gcc 12. At some point it doesn't make sense anymore to try to game the autovectorizer. Instead we should just use SIMD intrinsics (or rather a wrapping library like xsimd to stay portable).

daschuer · 2022-05-27T00:28:00Z

Luckily GCC is able to tell us if the loop is vectorized when creating new loops.

I have commented all checked functions with
// note: LOOP VECTORIZED.
As a warning to not touch it without rechecking if vectorizing is still applied.
According to the comments it has been done with GCC 7.5

daschuer · 2022-05-27T00:30:09Z

By the way, original some of these functions where written in inline assembler using the MMX register.
This did not scale well with the new instruction sets, and that was the reason to let the compiler decides which registers to use.

Previously, invocations of `RampingValue::getNext()` caused inter- dependence between loop iterations, making vectorization impossible. This approach removes the state from `RampingValue` and thus also the loop-iteration interdependence. `getNext()` has been replaced by `getNth(int step)`. While multiplication (`getNth`)is technically more expensive than addition (`getNext`), the vectorization possibility results in a 1.1 - 6x speedup depending on optimizer-agressiveness.

Swiftb0y · 2022-05-27T00:39:43Z

I think something in our CI is wrong. The last commit clearly did not build, yet GitHub says "All checks have passed".

Swiftb0y · 2022-05-28T04:07:26Z

Merge?

daschuer · 2022-06-02T16:24:19Z

Yes, thank you.

uklotzde reviewed May 26, 2022

View reviewed changes

Swiftb0y force-pushed the constexpr-vectorizer-friendly-rampingvalue branch from 5b90361 to 06e4108 Compare May 26, 2022 22:40

Swiftb0y changed the base branch from main to 2.3 May 26, 2022 22:40

daschuer approved these changes May 26, 2022

View reviewed changes

Swiftb0y force-pushed the constexpr-vectorizer-friendly-rampingvalue branch from 06e4108 to 750bf31 Compare May 27, 2022 00:52

daschuer merged commit 55c5e7e into mixxxdj:2.3 Jun 2, 2022

Swiftb0y deleted the constexpr-vectorizer-friendly-rampingvalue branch June 2, 2022 22:26

daschuer added this to the 2.4.0 milestone Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(util): make RampingValue vectorizer-friendly #4770

refactor(util): make RampingValue vectorizer-friendly #4770

Swiftb0y commented May 26, 2022

uklotzde left a comment

Be-ing commented May 26, 2022

Swiftb0y commented May 26, 2022 •

edited

Loading

Swiftb0y commented May 26, 2022

uklotzde commented May 26, 2022

daschuer left a comment

daschuer May 26, 2022

daschuer May 26, 2022

daschuer May 26, 2022

daschuer commented May 26, 2022

Be-ing commented May 26, 2022

Swiftb0y commented May 26, 2022 •

edited

Loading

daschuer commented May 26, 2022

Swiftb0y commented May 26, 2022

Be-ing commented May 26, 2022

daschuer commented May 27, 2022

Swiftb0y commented May 27, 2022

Swiftb0y commented May 27, 2022

daschuer commented May 27, 2022

daschuer commented May 27, 2022

Swiftb0y commented May 27, 2022

Swiftb0y commented May 28, 2022

daschuer commented Jun 2, 2022

refactor(util): make RampingValue vectorizer-friendly #4770

refactor(util): make RampingValue vectorizer-friendly #4770

Conversation

Swiftb0y commented May 26, 2022

uklotzde left a comment

Choose a reason for hiding this comment

Be-ing commented May 26, 2022

Swiftb0y commented May 26, 2022 • edited Loading

Swiftb0y commented May 26, 2022

uklotzde commented May 26, 2022

daschuer left a comment

Choose a reason for hiding this comment

daschuer May 26, 2022

Choose a reason for hiding this comment

daschuer May 26, 2022

Choose a reason for hiding this comment

daschuer May 26, 2022

Choose a reason for hiding this comment

daschuer commented May 26, 2022

Be-ing commented May 26, 2022

Swiftb0y commented May 26, 2022 • edited Loading

daschuer commented May 26, 2022

Swiftb0y commented May 26, 2022

Be-ing commented May 26, 2022

daschuer commented May 27, 2022

Swiftb0y commented May 27, 2022

Swiftb0y commented May 27, 2022

daschuer commented May 27, 2022

daschuer commented May 27, 2022

Swiftb0y commented May 27, 2022

Swiftb0y commented May 28, 2022

daschuer commented Jun 2, 2022

Swiftb0y commented May 26, 2022 •

edited

Loading

Swiftb0y commented May 26, 2022 •

edited

Loading