-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teach the SIMD metadata group match to defer masking #4595
base: trunk
Are you sure you want to change the base?
Conversation
When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster. For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though. For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc. Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while. Co-authored-by: Danila Kutenin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LG, though some comments because I can't poke too much on due to lack of a neon machine.
if constexpr (ByteEncodingMask != 0) { | ||
// Apply an increment mask to the bits first. This is used with the byte | ||
// encoding when the mask isn't needed until we begin incrementing. | ||
static_assert(BitIndexT::ByteEncoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at this because it's the only use of BitIndexT::ByteEncoding
(the surrounding code doesn't access it outside the static_assert
). I believe that because this is inside if constexpr
removing it is not a compile failure on most platforms (i.e., I'm on x86, I can freely revert the static constexpr bool
addition without a compile error).
Had you considered shifting this to make it a compile error, like a class-level static_assert(ByteEncodingMask == 0 || BitIndexT::ByteEncoding);
or maybe something with requires
?
template <typename FriendBitIndexT, | ||
FriendBitIndexT::BitsT FriendByteEncodingMask> | ||
friend class BitIndexRange; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this used? Is it something specific to arm? I was messing around and it causes issues for requires
due to the different template argument names; I tried deleting it and that worked fine, but maybe it's something with SIMDMatchPresent? Maybe something suitable for comments and/or something to cause a cross-platform compilation error?
// Return whichever result we're using. This uses an invoked lambda to deduce | ||
// the type from only the selected return statement, allowing them to be | ||
// different types. | ||
return [&] { | ||
if constexpr (UseSIMD) { | ||
return simd_result; | ||
} else { | ||
return portable_result; | ||
} | ||
}(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you do this twice, and it's kind of subtle, had you considered a helper function? i.e., something like:
// Return whichever result we're using. This uses an invoked lambda to deduce
// the type from only the selected return statement, allowing them to be
// different types.
template <bool If, typename ThenT, typename ElseT>
inline auto ConstexprTernary(ThenT then_val, ElseT else_val) -> auto {
return [&] {
if constexpr (If) {
return then_val;
} else {
return else_val;
}
}();
}
While thinking about this, I also was wondering whether there was a good template solution, which got me thinking about requires
. So here's that thought:
// Behaves as a ternary, but allowing different types on the return.
template <bool If, typename ThenT, typename ElseT> requires (If)
inline auto ConstexprTernary(ThenT then_val, ElseT /*else_val*/) -> ThenT {
return then_val;
}
template <bool If, typename ThenT, typename ElseT> requires (!If)
inline auto ConstexprTernary(ThenT /*then_val*/, ElseT else_val) -> ElseT {
return else_val;
}
Allowing (either way):
return ConstexprTernary<UseSIMD>(simd_result, portable_result);
When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster.
For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though.
For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc.
Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while.