Low-level: discrepancy between field arithmetic performance and elliptic curve performance #446

mratsim · 2024-07-27T20:30:29Z

As mentioned in #445, there is a large discrepancy between the performance when benchmarking field arithmetic and the elliptic curves built on top, especially on Secp256k1 vs libsecp256k1 and RustCrypto.
We start with a 1.7x advantage for field that gets reduced to a 0.85x disadvantage on constant-time code.

There is an unexplained performance bug.

Some possibilities:

There is a parameter passing bug similar to Internal API: in-place vs result #21 and Extremely bad codegen on Fp2 #146 however looking into the assembly with Ghidra, we have 1~~2 LEA and 1~~3 MOV befor function calls, doesn't seem costly enough for such a difference. There is the regular if adx test but it should be cached and almost costless on Haswell and later CPU.
Unsaturated arithmetic allows for greater ILP (Instruction level parallelism. This seems unlikely as field arithmetic with unsaturated is 2x slower than my impl.
Cache effects. For example we don't hardcode the prime modulus and after a long computation it might be evicted from cache.

The text was updated successfully, but these errors were encountered:

mratsim · 2024-07-27T20:53:25Z

CTT_LTO=false CC=clang nimble bench_ec_g1

with only secp256k1 EC G1 Jacobian addition and the if hasADX() checks forced true for secp256k1/Crandall primes

LTO ends up inlinining everything into the benchmark function (that are explicitly tagged {.noinline.} since #445.

It's extremely suspicious that addmod takes more time that multiplication, and submod takes that much time as well.

Addmod

Local percents

Global percents

Submod

Local percents

Global percents

Partial reduce

The crandall partial reduce seems to need a prefetch around mulx over rsp/temporaries:

vs mulx of registers

mratsim · 2024-07-27T21:24:47Z

perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations build/bench/bench_ec_g1_clang_asmIfAvailable

Whoops, cache misses it is

perf record -e cache-references,cache-misses,cycles --call-graph dwarf build/bench/bench_ec_g1_clang_asmIfAvailable

Addmod

Cache misses local

Cache misses global

Submod

Cache-misses local

Cache-misses global

mratsim · 2024-07-27T21:42:46Z

Reverse engineering

Each call has some mov and LEA ceremony but recent CPUs can issue 4 mov per cycles

Function calls decompiled from assembly

void sum__bench95ec95g49_u1825
               (undefined8 param_1,longlong param_2,undefined8 param_3,undefined8 param_4)

{
  longlong lVar1;
  longlong lVar2;
  longlong unaff_RSI;
  longlong in_FS_OFFSET;
  undefined local_158 [32];
  undefined local_138 [32];
  undefined local_118 [32];
  undefined local_f8 [32];
  undefined local_d8 [32];
  undefined local_b8 [32];
  undefined local_98 [32];
  undefined local_78 [32];
  undefined local_58 [32];
  longlong local_38;
  
  local_38 = *(longlong *)(in_FS_OFFSET + 0x28);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127);
  lVar1 = param_2 + 0x20;
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127);
  lVar2 = param_2 + 0x40;
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,
             unaff_RSI + 0x20,param_3,param_4,param_2);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,lVar1);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_f8);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_98);
  submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_f8);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,
             unaff_RSI + 0x40);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,lVar2);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_118);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
  submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_118);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,
             unaff_RSI + 0x40);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,lVar2);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
  submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_78);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_118);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
  submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_98);
  submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_118);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_78);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_158);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_98);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_d8);
  mulCran_asm_adx__bench95ec95g49_u823
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_f8);
  addmod_asm__bench95ec95g49_u1081
            (Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_78);
  if (*(longlong *)(in_FS_OFFSET + 0x28) == local_38) {
    return;
  }
                    /* WARNING: Subroutine does not return */
  __stack_chk_fail();
}

mratsim · 2024-08-05T12:33:39Z

Another potential slowness source is code alignment: https://www.bazhenov.me/posts/2024-02-performance-roulette/

As CPU fetch data 64B at a time, instruction density can be quite important. If we inline the modulus and load it in register with MOV, it requires 10 bytes to encode it. 1 byte for REX prefix, 1 for the MOV instructions and 8 for the 64-bit numbers.
https://github.com/mratsim/jitterland/blob/02febc4/jit/jit_x86_64_load_store.nim#L12-L18

func mov*(a: var Assembler[Reg_X86_64], reg: static range[rax..rdi], imm64: pointer) {.inline.} =
  ## Move immediate 64-bit pointer value into register
  a.code.add [
    rex_prefix(w = 1),
    static(0xB8.byte + reg.byte) # Move imm to r
  ]
  a.code.add cast[array[8, byte]](imm64)

So storing as a const and loading from it should be preferred.

…ilar to #446

mratsim added bug 🪲 Something isn't working performance 🏁 labels Jul 27, 2024

mratsim added a commit that referenced this issue Jul 31, 2024

Explore prefetching to try to fix #446

cc7eba7

mratsim mentioned this issue Jul 31, 2024

Cache misses: explore prefetching to address #446 #447

Draft

mratsim added a commit that referenced this issue Aug 22, 2024

failed attempt at solving towering / base field perf discrepancy, sim…

112ab49

…ilar to #446

mratsim mentioned this issue Aug 22, 2024

Tentative: Fp2 and towering optimizations #462

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-level: discrepancy between field arithmetic performance and elliptic curve performance #446

Low-level: discrepancy between field arithmetic performance and elliptic curve performance #446

mratsim commented Jul 27, 2024

mratsim commented Jul 27, 2024

mratsim commented Jul 27, 2024

mratsim commented Jul 27, 2024

mratsim commented Aug 5, 2024

Low-level: discrepancy between field arithmetic performance and elliptic curve performance #446

Low-level: discrepancy between field arithmetic performance and elliptic curve performance #446

Comments

mratsim commented Jul 27, 2024

mratsim commented Jul 27, 2024

Addmod

Submod

Partial reduce

mratsim commented Jul 27, 2024

Addmod

Cache misses local

Cache misses global

Submod

Cache-misses local

Cache-misses global

mratsim commented Jul 27, 2024

mratsim commented Aug 5, 2024