Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(util): make RampingValue vectorizer-friendly #4770

Merged

Conversation

Swiftb0y
Copy link
Member

Previously, invocations of RampingValue::getNext() caused inter-
dependence between loop iterations, making vectorization
impossible. This approach removes the state from RampingValue
and thus also the loop-iteration interdependence. getNext()
has been replaced by getNth(int step).
While multiplication (getNth)is technically more expensive than
addition (getNext), the vectorization possibility results in a
1.1 - 6x speedup depending on optimizer-agressiveness.

See this microbenchmark: https://www.quick-bench.com/q/PHqdbeYORuRS_x6hLtN-tBgvV0g

Copy link
Contributor

@uklotzde uklotzde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.

Target 2.3 instead of main?

@Be-ing
Copy link
Contributor

Be-ing commented May 26, 2022

Interesting... I'm confused how this is vectorizable considering one of the inputs to the calculation changes with every iteration.

@Swiftb0y
Copy link
Member Author

Swiftb0y commented May 26, 2022

Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.

Target 2.3 instead of main?

Sure.

Interesting... I'm confused how this is vectorizable considering one of the inputs to the calculation changes with every iteration.

If I understand correctly, you are asking why this is vectorizable even though getNth is depending on the loop counter? I can't really come up with a good answer myself to be honest. Most compilers can explain their reasoning for (not) vectorizing. For example in GCC with -ftree-vectorizer-verbose=1 -fopt-info-vec-missed (or -fopt-info-vec-missed). Using that, you can see that the autovectorizer fails because of m_value += m_increment since that creates state that is dependent on the previous loop iteration.

@Swiftb0y
Copy link
Member Author

Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.

I already explained the reasoning quite verbosely in the commit, should I just copy the same to the sourcefile?

@uklotzde
Copy link
Contributor

Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.

I already explained the reasoning quite verbosely in the commit, should I just copy the same to the sourcefile?

Ask yourself: Do you read the commit messages about the code you are just editing? Everything that belongs to the code goes into the code base. The commit messages only describe the transition, not the current state (in terms of a state machine).

@Swiftb0y Swiftb0y force-pushed the constexpr-vectorizer-friendly-rampingvalue branch from 5b90361 to 06e4108 Compare May 26, 2022 22:40
@Swiftb0y Swiftb0y changed the base branch from main to 2.3 May 26, 2022 22:40
Copy link
Member

@daschuer daschuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

However, non of the loops is vectorized.
A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)

I have played a bit with this but these loops are all to big or with function calls that can not be vectorized.

@@ -454,7 +454,7 @@ void MixxxPlateX2::processBuffer(const sample_t* in, sample_t* out, const uint f

// loop through the buffer, processing each sample
for (uint i = 0; i + 1 < frames; i += 2) {
sample_t mono_sample = send.getNext() * (in[i] + in[i + 1]) / 2;
sample_t mono_sample = send.getNth(i / 2) * (in[i] + in[i + 1]) / 2;
PlateStub::process(mono_sample, decay, &out[i], &out[i+1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is not vecorized. It is not possible because the process() call.

@@ -191,12 +191,14 @@ void EchoEffect::processChannel(const ChannelHandle& handle, EchoGroupState* pGr
pGroupState->prev_feedback,
bufferParameters.framesPerBuffer());

int rampIndex = 0;
//TODO: rewrite to remove assumption of stereo buffer
for (SINT i = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not vectorized

@@ -195,13 +195,15 @@ void FlangerEffect::processChannel(const ChannelHandle& handle,
CSAMPLE* delayLeft = pState->delayLeft;
CSAMPLE* delayRight = pState->delayRight;

int rampIndex = 0;
for (SINT i = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not vectorized

@daschuer
Copy link
Member

@uklotzde merge?

@Be-ing
Copy link
Contributor

Be-ing commented May 26, 2022

A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)

That means iterating over samples instead of frames... ugly :/

@Swiftb0y
Copy link
Member Author

Swiftb0y commented May 26, 2022

However, non of the loops is vectorized.

Yes, none of the loops in the built-in effects are actually vectorized because of other interdependencies. I only changed what I needed so they fit the new API (since the mutable getNext approach was not constexpr friendly).

@daschuer
Copy link
Member

A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)

That means iterating over samples instead of frames... ugly :/

No, not at all, see:

for (int i = 0; i < numSamples / 2; ++i) {

@Swiftb0y
Copy link
Member Author

A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)

Can you elaborate? IMO both of these preconditions are not necessary. Iterating using a size_t should be just as valid as iterating over a uint64_t or ptrdiff_t (which is what SINT actually is). Much more important for vectorization is that the values are actually aligned (to the size of the vector registers, not the contained values) otherwise the CPU is either much slower or aborts straight away.

That means iterating over samples instead of frames... ugly :/

There are different trade-offs for iterating over adjacent frames instead of samples. Its a similar trade-off to the array-of-structs vs. struct-of-arrays problem encountered when doing data-driven design. The performance usually depends on the access patterns to the data (because of cache). The struct-of-arrays approach is less OOP-like but allows for vectorization in many more cases (though this is not necessarily true as modern CPUs can load data into their vector registers with variable offsets). Its also cache-friendlier when iterating over arrays sequentially instead of in parallel. I think the design which is more present in the audio industry would be to have separate arrays of audio data instead of our interleaved design, this would also make it much easier to migrate our engine away from the stereo-audio assumption.

@Be-ing
Copy link
Contributor

Be-ing commented May 26, 2022

 for (int i = 0; i < numSamples / 2; ++i) { 

That requires the loop implementation to have a line of code handling each channel, exacerbating the assumption of stereo everywhere.

@daschuer
Copy link
Member

Can you elaborate?

From your theory you are right. But I have made the experience when optimizing /src/util/sample.cpp that the pattern recognition of gcc works reliable using a signed iterator and go in single steps. It might be the case that the compiler has been improved since then. But as a rule of thumb it should be still valid.

I think the design which is more present in the audio industry would be to have separate arrays of audio data instead of our interleaved design, this would also make it much easier to migrate our engine away from the stereo-audio assumption.

I think GStreamer has changed that during its history just because of that. It will be a topic in parts of Mixxx as well when we introduce N-channel (Stems)

@Swiftb0y
Copy link
Member Author

That requires the loop implementation to have a line of code handling each channel, exacerbating the assumption of stereo everywhere.

If you mean that we should use a buffer[channel][sample] array, then not necessarily, if your audio processing code is channel independent / mono, you simply add another nested loop that iterates over every channel.

@Swiftb0y
Copy link
Member Author

But I have made the experience when optimizing /src/util/sample.cpp that the pattern recognition of gcc works reliable using a signed iterator and go in single steps

I wonder in which GCC version these observations were made. I'm sure they don't apply anymore in gcc 12. At some point it doesn't make sense anymore to try to game the autovectorizer. Instead we should just use SIMD intrinsics (or rather a wrapping library like xsimd to stay portable).

@daschuer
Copy link
Member

Luckily GCC is able to tell us if the loop is vectorized when creating new loops.

I have commented all checked functions with
// note: LOOP VECTORIZED.
As a warning to not touch it without rechecking if vectorizing is still applied.
According to the comments it has been done with GCC 7.5

@daschuer
Copy link
Member

By the way, original some of these functions where written in inline assembler using the MMX register.
This did not scale well with the new instruction sets, and that was the reason to let the compiler decides which registers to use.

Previously, invocations of `RampingValue::getNext()` caused inter-
dependence between loop iterations, making vectorization
impossible. This approach removes the state from `RampingValue`
and thus also the loop-iteration interdependence. `getNext()`
has been replaced by `getNth(int step)`.
While multiplication (`getNth`)is technically more expensive than
addition (`getNext`), the vectorization possibility results in a
1.1 - 6x speedup depending on optimizer-agressiveness.
@Swiftb0y
Copy link
Member Author

I think something in our CI is wrong. The last commit clearly did not build, yet GitHub says "All checks have passed".

@Swiftb0y Swiftb0y force-pushed the constexpr-vectorizer-friendly-rampingvalue branch from 06e4108 to 750bf31 Compare May 27, 2022 00:52
@Swiftb0y
Copy link
Member Author

Merge?

@daschuer
Copy link
Member

daschuer commented Jun 2, 2022

Yes, thank you.

@daschuer daschuer merged commit 55c5e7e into mixxxdj:2.3 Jun 2, 2022
@Swiftb0y Swiftb0y deleted the constexpr-vectorizer-friendly-rampingvalue branch June 2, 2022 22:26
@daschuer daschuer added this to the 2.4.0 milestone Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants