-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(util): make RampingValue vectorizer-friendly #4770
refactor(util): make RampingValue vectorizer-friendly #4770
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding. Please add comments to prevent anyone from undoing these changes in the future! The reason why getNth() is used is not obvious.
Target 2.3 instead of main?
Interesting... I'm confused how this is vectorizable considering one of the inputs to the calculation changes with every iteration. |
Sure.
If I understand correctly, you are asking why this is vectorizable even though |
I already explained the reasoning quite verbosely in the commit, should I just copy the same to the sourcefile? |
Ask yourself: Do you read the commit messages about the code you are just editing? Everything that belongs to the code goes into the code base. The commit messages only describe the transition, not the current state (in terms of a state machine). |
5b90361
to
06e4108
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
However, non of the loops is vectorized.
A vectorized loop needs to have "++i" and i needs to be a signed int (we use SINT)
I have played a bit with this but these loops are all to big or with function calls that can not be vectorized.
@@ -454,7 +454,7 @@ void MixxxPlateX2::processBuffer(const sample_t* in, sample_t* out, const uint f | |||
|
|||
// loop through the buffer, processing each sample | |||
for (uint i = 0; i + 1 < frames; i += 2) { | |||
sample_t mono_sample = send.getNext() * (in[i] + in[i + 1]) / 2; | |||
sample_t mono_sample = send.getNth(i / 2) * (in[i] + in[i + 1]) / 2; | |||
PlateStub::process(mono_sample, decay, &out[i], &out[i+1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is not vecorized. It is not possible because the process() call.
@@ -191,12 +191,14 @@ void EchoEffect::processChannel(const ChannelHandle& handle, EchoGroupState* pGr | |||
pGroupState->prev_feedback, | |||
bufferParameters.framesPerBuffer()); | |||
|
|||
int rampIndex = 0; | |||
//TODO: rewrite to remove assumption of stereo buffer | |||
for (SINT i = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not vectorized
@@ -195,13 +195,15 @@ void FlangerEffect::processChannel(const ChannelHandle& handle, | |||
CSAMPLE* delayLeft = pState->delayLeft; | |||
CSAMPLE* delayRight = pState->delayRight; | |||
|
|||
int rampIndex = 0; | |||
for (SINT i = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not vectorized
@uklotzde merge? |
That means iterating over samples instead of frames... ugly :/ |
Yes, none of the loops in the built-in effects are actually vectorized because of other interdependencies. I only changed what I needed so they fit the new API (since the mutable |
No, not at all, see: Line 145 in b0601aa
|
Can you elaborate? IMO both of these preconditions are not necessary. Iterating using a
There are different trade-offs for iterating over adjacent frames instead of samples. Its a similar trade-off to the array-of-structs vs. struct-of-arrays problem encountered when doing data-driven design. The performance usually depends on the access patterns to the data (because of cache). The struct-of-arrays approach is less OOP-like but allows for vectorization in many more cases (though this is not necessarily true as modern CPUs can load data into their vector registers with variable offsets). Its also cache-friendlier when iterating over arrays sequentially instead of in parallel. I think the design which is more present in the audio industry would be to have separate arrays of audio data instead of our interleaved design, this would also make it much easier to migrate our engine away from the stereo-audio assumption. |
That requires the loop implementation to have a line of code handling each channel, exacerbating the assumption of stereo everywhere. |
From your theory you are right. But I have made the experience when optimizing /src/util/sample.cpp that the pattern recognition of gcc works reliable using a signed iterator and go in single steps. It might be the case that the compiler has been improved since then. But as a rule of thumb it should be still valid.
I think GStreamer has changed that during its history just because of that. It will be a topic in parts of Mixxx as well when we introduce N-channel (Stems) |
If you mean that we should use a |
I wonder in which GCC version these observations were made. I'm sure they don't apply anymore in gcc 12. At some point it doesn't make sense anymore to try to game the autovectorizer. Instead we should just use SIMD intrinsics (or rather a wrapping library like xsimd to stay portable). |
Luckily GCC is able to tell us if the loop is vectorized when creating new loops. I have commented all checked functions with |
By the way, original some of these functions where written in inline assembler using the MMX register. |
Previously, invocations of `RampingValue::getNext()` caused inter- dependence between loop iterations, making vectorization impossible. This approach removes the state from `RampingValue` and thus also the loop-iteration interdependence. `getNext()` has been replaced by `getNth(int step)`. While multiplication (`getNth`)is technically more expensive than addition (`getNext`), the vectorization possibility results in a 1.1 - 6x speedup depending on optimizer-agressiveness.
I think something in our CI is wrong. The last commit clearly did not build, yet GitHub says "All checks have passed". |
06e4108
to
750bf31
Compare
Merge? |
Yes, thank you. |
Previously, invocations of
RampingValue::getNext()
caused inter-dependence between loop iterations, making vectorization
impossible. This approach removes the state from
RampingValue
and thus also the loop-iteration interdependence.
getNext()
has been replaced by
getNth(int step)
.While multiplication (
getNth
)is technically more expensive thanaddition (
getNext
), the vectorization possibility results in a1.1 - 6x speedup depending on optimizer-agressiveness.
See this microbenchmark: https://www.quick-bench.com/q/PHqdbeYORuRS_x6hLtN-tBgvV0g