Teach the SIMD metadata group match to defer masking #4595

chandlerc · 2024-11-26T20:07:38Z

When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster.

For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though.

For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc.

Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while.

When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster. For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though. For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc. Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while. Co-authored-by: Danila Kutenin <[email protected]>

jonmeow

Generally LG, though some comments because I can't poke too much on due to lack of a neon machine.

jonmeow · 2024-12-02T17:26:22Z

common/raw_hashtable_metadata_group.h

+      if constexpr (ByteEncodingMask != 0) {
+        // Apply an increment mask to the bits first. This is used with the byte
+        // encoding when the mask isn't needed until we begin incrementing.
+        static_assert(BitIndexT::ByteEncoding);


I'm looking at this because it's the only use of BitIndexT::ByteEncoding (the surrounding code doesn't access it outside the static_assert). I believe that because this is inside if constexpr removing it is not a compile failure on most platforms (i.e., I'm on x86, I can freely revert the static constexpr bool addition without a compile error).

Had you considered shifting this to make it a compile error, like a class-level static_assert(ByteEncodingMask == 0 || BitIndexT::ByteEncoding); or maybe something with requires?

jonmeow · 2024-12-02T17:44:40Z

common/raw_hashtable_metadata_group.h

+  template <typename FriendBitIndexT,
+            FriendBitIndexT::BitsT FriendByteEncodingMask>
+  friend class BitIndexRange;


How is this used? Is it something specific to arm? I was messing around and it causes issues for requires due to the different template argument names; I tried deleting it and that worked fine, but maybe it's something with SIMDMatchPresent? Maybe something suitable for comments and/or something to cause a cross-platform compilation error?

jonmeow · 2024-12-02T18:21:03Z

common/raw_hashtable_metadata_group.h

+  // Return whichever result we're using. This uses an invoked lambda to deduce
+  // the type from only the selected return statement, allowing them to be
+  // different types.
+  return [&] {
+    if constexpr (UseSIMD) {
+      return simd_result;
+    } else {
+      return portable_result;
+    }
+  }();


Given you do this twice, and it's kind of subtle, had you considered a helper function? i.e., something like:

// Return whichever result we're using. This uses an invoked lambda to deduce // the type from only the selected return statement, allowing them to be // different types. template <bool If, typename ThenT, typename ElseT> inline auto ConstexprTernary(ThenT then_val, ElseT else_val) -> auto { return [&] { if constexpr (If) { return then_val; } else { return else_val; } }(); }

While thinking about this, I also was wondering whether there was a good template solution, which got me thinking about requires. So here's that thought:

// Behaves as a ternary, but allowing different types on the return. template <bool If, typename ThenT, typename ElseT> requires (If) inline auto ConstexprTernary(ThenT then_val, ElseT /*else_val*/) -> ThenT { return then_val; } template <bool If, typename ThenT, typename ElseT> requires (!If) inline auto ConstexprTernary(ThenT /*then_val*/, ElseT else_val) -> ElseT { return else_val; }

Allowing (either way):

return ConstexprTernary<UseSIMD>(simd_result, portable_result);

github-actions bot added the toolchain label Nov 26, 2024

github-actions bot requested a review from jonmeow November 26, 2024 20:07

jonmeow approved these changes Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teach the SIMD metadata group match to defer masking #4595

Teach the SIMD metadata group match to defer masking #4595

chandlerc commented Nov 26, 2024

jonmeow left a comment

jonmeow Dec 2, 2024

jonmeow Dec 2, 2024

jonmeow Dec 2, 2024

Teach the SIMD metadata group match to defer masking #4595

Are you sure you want to change the base?

Teach the SIMD metadata group match to defer masking #4595

Conversation

chandlerc commented Nov 26, 2024

jonmeow left a comment

Choose a reason for hiding this comment

jonmeow Dec 2, 2024

Choose a reason for hiding this comment

jonmeow Dec 2, 2024

Choose a reason for hiding this comment

jonmeow Dec 2, 2024

Choose a reason for hiding this comment