-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low-level: discrepancy between field arithmetic performance and elliptic curve performance #446
Comments
with only secp256k1 EC G1 Jacobian addition and the LTO ends up inlinining everything into the benchmark function (that are explicitly tagged {.noinline.} since #445. It's extremely suspicious that addmod takes more time that multiplication, and submod takes that much time as well. AddmodLocal percents SubmodLocal percents Partial reduceThe crandall partial reduce seems to need a prefetch around mulx over rsp/temporaries: |
AddmodCache misses localCache misses globalSubmodCache-misses localCache-misses global |
Reverse engineering Each call has some mov and LEA ceremony but recent CPUs can issue 4 mov per cycles Function calls decompiled from assembly void sum__bench95ec95g49_u1825
(undefined8 param_1,longlong param_2,undefined8 param_3,undefined8 param_4)
{
longlong lVar1;
longlong lVar2;
longlong unaff_RSI;
longlong in_FS_OFFSET;
undefined local_158 [32];
undefined local_138 [32];
undefined local_118 [32];
undefined local_f8 [32];
undefined local_d8 [32];
undefined local_b8 [32];
undefined local_98 [32];
undefined local_78 [32];
undefined local_58 [32];
longlong local_38;
local_38 = *(longlong *)(in_FS_OFFSET + 0x28);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127);
lVar1 = param_2 + 0x20;
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127);
lVar2 = param_2 + 0x40;
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,
unaff_RSI + 0x20,param_3,param_4,param_2);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,lVar1);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_f8);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_98);
submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_f8);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,
unaff_RSI + 0x40);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,lVar2);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_118);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_118);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,
unaff_RSI + 0x40);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,lVar2);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_78);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_118);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_b8);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_58);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_138);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_98);
submod_asm__OOZconstantineZnamedZconstantsZbandersnatch95subgroups_u482
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_118);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_78);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_158);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_98);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_d8);
mulCran_asm_adx__bench95ec95g49_u823
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_f8);
addmod_asm__bench95ec95g49_u1081
(Secp256k1_Modulus__OOZconstantineZnamedZconfig95fields95and95curves_u1127,local_78);
if (*(longlong *)(in_FS_OFFSET + 0x28) == local_38) {
return;
}
/* WARNING: Subroutine does not return */
__stack_chk_fail();
} |
Another potential slowness source is code alignment: https://www.bazhenov.me/posts/2024-02-performance-roulette/ As CPU fetch data 64B at a time, instruction density can be quite important. If we inline the modulus and load it in register with MOV, it requires 10 bytes to encode it. 1 byte for REX prefix, 1 for the MOV instructions and 8 for the 64-bit numbers. func mov*(a: var Assembler[Reg_X86_64], reg: static range[rax..rdi], imm64: pointer) {.inline.} =
## Move immediate 64-bit pointer value into register
a.code.add [
rex_prefix(w = 1),
static(0xB8.byte + reg.byte) # Move imm to r
]
a.code.add cast[array[8, byte]](imm64) So storing as a const and loading from it should be preferred. |
As mentioned in #445, there is a large discrepancy between the performance when benchmarking field arithmetic and the elliptic curves built on top, especially on Secp256k1 vs libsecp256k1 and RustCrypto.
We start with a 1.7x advantage for field that gets reduced to a 0.85x disadvantage on constant-time code.
There is an unexplained performance bug.
Some possibilities:
2 LEA and 13 MOV befor function calls, doesn't seem costly enough for such a difference. There is the regularif adx
test but it should be cached and almost costless on Haswell and later CPU.The text was updated successfully, but these errors were encountered: