Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR uses the same approach than x86 for doing the OnDemand parsing on ARM.
On a NVIDIA Grace cpu this results in a 5x speedup for the twitter benchmark and ~3x for citm_catalog.
We use the simdjson
simd8x64
type to obtain a 64 bit mask that allows us to operate on 64 characters at a time. Although the bitmask obtention is expensive and requires several neon instructions, it makes us able to process 64 characters per instruction using the bitmaps. If we instead use theshrn
instructions we would be able to process only 16 characters per instruction.This patch also uses this approach in the sve code but using neon instructions, In the Neoverse v2 optimization guide the comparison operation has a latency of 4 cycles and a throughput of 1 instruction per cycle while for neon instructions the latency is 2 cycles and throughput is 4 instructions per cycle.
Benchmark results
build/benchmark/bench --benchmark_filter=SonicOnDema
Master branch
This PR
This PR is contributed by NVIDIA