Fast OnDemand parsing for Neoverse #94

emcastillo · 2024-09-11T02:10:45Z

This PR uses the same approach than x86 for doing the OnDemand parsing on ARM.
On a NVIDIA Grace cpu this results in a 5x speedup for the twitter benchmark and ~3x for citm_catalog.

We use the simdjson simd8x64 type to obtain a 64 bit mask that allows us to operate on 64 characters at a time. Although the bitmask obtention is expensive and requires several neon instructions, it makes us able to process 64 characters per instruction using the bitmaps. If we instead use the shrn instructions we would be able to process only 16 characters per instruction.

This patch also uses this approach in the sve code but using neon instructions, In the Neoverse v2 optimization guide the comparison operation has a latency of 4 cycles and a throughput of 1 instruction per cycle while for neon instructions the latency is 2 cycles and throughput is 4 instructions per cycle.

Benchmark results build/benchmark/bench --benchmark_filter=SonicOnDema

Master branch

twitter/SonicOnDemand_Normal           111149 ns       111152 ns         6297 bytes_per_second=2.21522Gi/s Normal
citm_catalog/SonicOnDemand_Fronter      33629 ns        33630 ns        20804 bytes_per_second=47.8316Gi/s Fronter
twitter/SonicOnDemand_NotFound         111161 ns       111165 ns         6298 bytes_per_second=2.21496Gi/s NotFound

This PR

twitter/SonicOnDemand_Normal            22625 ns        22624 ns        30718 bytes_per_second=10.8832Gi/s Normal
citm_catalog/SonicOnDemand_Fronter      12861 ns        12862 ns        54399 bytes_per_second=125.067Gi/s Fronter
twitter/SonicOnDemand_NotFound          22423 ns        22422 ns        31349 bytes_per_second=10.9814Gi/s NotFound

This PR is contributed by NVIDIA

liuq19 · 2024-10-15T00:26:45Z

@emcastillo Thanks, need to format the codes

emcastillo · 2024-11-06T03:01:22Z

@liuq19 sorry for the delay. I pushed some formatting changes.
I ran the files through "clang-format" hope is that enough

emcastillo force-pushed the arm-fast-ondemand branch from a8ea065 to 7ba60c8 Compare September 11, 2024 02:14

Fast OnDemand for Neoverse

6cb2b07

emcastillo force-pushed the arm-fast-ondemand branch from 7ba60c8 to 6cb2b07 Compare September 11, 2024 02:21

Use a template for SkipContainer

6b87505

Clang-format

36a56d5

liuq19 approved these changes Nov 7, 2024

View reviewed changes

liuq19 merged commit 91f84fc into bytedance:master Nov 7, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast OnDemand parsing for Neoverse #94

Fast OnDemand parsing for Neoverse #94

emcastillo commented Sep 11, 2024 •

edited

Loading

liuq19 commented Oct 15, 2024

emcastillo commented Nov 6, 2024

Fast OnDemand parsing for Neoverse #94

Fast OnDemand parsing for Neoverse #94

Conversation

emcastillo commented Sep 11, 2024 • edited Loading

liuq19 commented Oct 15, 2024

emcastillo commented Nov 6, 2024

emcastillo commented Sep 11, 2024 •

edited

Loading