Add bit packing to NodeImpl #4651

jonmeow · 2024-12-07T00:34:19Z

Just a small packing optimization. We currently have 222 NodeKinds, so this reduces us to just 30ish more we can add without needing to pack more. However, if we did, there would be a couple options for bringing the count down by reusing NodeKinds and disambiguating based on the token kind (the 29 infix operators as an example). Or we could just undo this.

I'm expecting this to yield a small improvement. I'll see if I can get better numbers since my machine's not really reliable, but here are some basic values.

Also suggesting to draw the use of ::RawEnumType for TokenKind, since bit packing appears to work without it. Hoping the static_assert is easier for people to understand the size of the field.

With the change:

----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations      Bytes      Lines     Tokens
----------------------------------------------------------------------------------------------------------------------------
BM_CompileAPIFileDenseDecls<Phase::Parse>/256         50399 ns        50359 ns        14336 104.588M/s 3.87217M/s 21.8629M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024       237823 ns       237629 ns         3072 136.721M/s 4.11986M/s 24.2058M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096       997645 ns       996771 ns          768 142.343M/s 4.04105M/s 23.9363M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384     4020308 ns      4018319 ns          192 152.041M/s 4.05966M/s 24.0874M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536    16691390 ns     16683058 ns           48 151.317M/s 3.92374M/s 23.2936M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144   75265735 ns     75233476 ns            8 135.842M/s 3.48421M/s 20.6862M/s

Without the change:

----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations      Bytes      Lines     Tokens
----------------------------------------------------------------------------------------------------------------------------
BM_CompileAPIFileDenseDecls<Phase::Parse>/256         51515 ns        51480 ns        13312 102.312M/s 3.78789M/s  21.387M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024       241040 ns       240900 ns         3072 134.865M/s 4.06392M/s 23.8771M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096       985593 ns       984657 ns          768 144.094M/s 4.09077M/s 24.2308M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384     4109327 ns      4105496 ns          192 148.813M/s 3.97345M/s  23.576M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536    17459655 ns     17446006 ns           48   144.7M/s 3.75215M/s  22.275M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144   80802815 ns     80737489 ns            8 126.581M/s 3.24668M/s  19.276M/s

chandlerc

I really like the packing! =D A comment inline about the bitfield representation though. I'm also trying to get some benchmark numbers.

chandlerc · 2024-12-13T23:15:30Z

toolchain/lex/tokenized_buffer.h

+    TokenKind kind_;
+    static_assert(sizeof(kind_) == 1, "TokenKind must pack to 8 bits");


I think this is changing more than maybe you intended?

Previously, all three fields are part of a single bitfield, allowing it to be loaded and stored as a single memory access. With this change, the kind is a separate field. I think this may make it more expensive to update the has_leading_space_ and token_payload_, as writes to those cannot write to kind_.

Well... This does change quite a lot...

The generated code in the lexer is very different after this change for example. But the change completely baffles me. It is significantly better with this change.

Previously, we would generate code sequences like on AArch64:

cmp w25, #0x0 mov w9, #0x72 // TokenKind 0x72 mov w10, #0x172 // TokenKind 0x72 | (1 << 8) csel w9, w10, w9, ne // ne = any ldr x10, [x19, #112] adrp x11, _DYNAMIC+0xdb60 orr w9, w9, w24, lsl #9 ldr x11, [x11, #352] orr x9, x9, x23, lsl #32 str x9, [x10, w8, uxtw #3]

This is pre-merging the constant token kind in this code path with the flag for leading space, then selecting based on whether the flag was set, and then combining the rest... I... I have no idea why the previous bitfield oriented code produces this bizarre generated code.

With this PR, we get much nicer:

lsl x9, x23, #32 bfi x24, x25, #1, #23 ldr x10, [x19, #112] adrp x12, _DYNAMIC+0xd9b8 mov w11, #0x72 // TokenKind 0x72 orr x9, x9, x24, lsl #8 ldr x12, [x12, #2552] orr x9, x9, x11 str x9, [x10, w8, uxtw #3]

Here, x24 has the value of has_leading_space_, either 0 or 1. Then we do bfi or "bitfield insert", to put the payload from bits 1-23. And then when we or it with the byte index we do lsl #8 to shift it up by 8 more creating space for the token kind, and or-ing that kind in.

This is quite a bit better code. One instruction fewer, two micro-ops fewer by my reading, etc.

So, I'm quite surprised that the code intending to make it easy for LLVM to emit this well backfires, and this code which is a bit trickier for the optimizer still ends up better... But it does! So, yay?

This is a measurable difference in the lexer, about 1.3% on my M1.

Anyways, ignore my original comment on this PR. This change at least seems good. I may send a follow-up to add a comment to the code that surfaces how fragile the generated code quality is here and what kinds of options we've considered.

chandlerc

Very surprisingly, this is a pretty clear regression for parsing for me, but a win for lexing.

The lexing improvement is because of the bitfield change. See below for the quite surprising analysis on what happened there.

I'm still trying to figure out why parsing is slowing down (and enough to eclipse the win from lexing).

chandlerc · 2024-12-14T02:25:43Z

toolchain/lex/tokenized_buffer.h

+    TokenKind kind_;
+    static_assert(sizeof(kind_) == 1, "TokenKind must pack to 8 bits");


Well... This does change quite a lot...

The generated code in the lexer is very different after this change for example. But the change completely baffles me. It is significantly better with this change.

Previously, we would generate code sequences like on AArch64:

cmp w25, #0x0 mov w9, #0x72 // TokenKind 0x72 mov w10, #0x172 // TokenKind 0x72 | (1 << 8) csel w9, w10, w9, ne // ne = any ldr x10, [x19, #112] adrp x11, _DYNAMIC+0xdb60 orr w9, w9, w24, lsl #9 ldr x11, [x11, #352] orr x9, x9, x23, lsl #32 str x9, [x10, w8, uxtw #3]

This is pre-merging the constant token kind in this code path with the flag for leading space, then selecting based on whether the flag was set, and then combining the rest... I... I have no idea why the previous bitfield oriented code produces this bizarre generated code.

With this PR, we get much nicer:

lsl x9, x23, #32 bfi x24, x25, #1, #23 ldr x10, [x19, #112] adrp x12, _DYNAMIC+0xd9b8 mov w11, #0x72 // TokenKind 0x72 orr x9, x9, x24, lsl #8 ldr x12, [x12, #2552] orr x9, x9, x11 str x9, [x10, w8, uxtw #3]

Here, x24 has the value of has_leading_space_, either 0 or 1. Then we do bfi or "bitfield insert", to put the payload from bits 1-23. And then when we or it with the byte index we do lsl #8 to shift it up by 8 more creating space for the token kind, and or-ing that kind in.

This is quite a bit better code. One instruction fewer, two micro-ops fewer by my reading, etc.

So, I'm quite surprised that the code intending to make it easy for LLVM to emit this well backfires, and this code which is a bit trickier for the optimizer still ends up better... But it does! So, yay?

This is a measurable difference in the lexer, about 1.3% on my M1.

Anyways, ignore my original comment on this PR. This change at least seems good. I may send a follow-up to add a comment to the code that surfaces how fragile the generated code quality is here and what kinds of options we've considered.

chandlerc

I think this LGTM with emplace_back replaced by push_back as I suggested below.

It's a wash performance wise from what I can tell in parse on an M1. I think the caching layer is just too good there for the size reduction to show up. But I don't see any particularly bad code and it does use less memory, so it seems like a good change.

chandlerc · 2024-12-14T04:09:19Z

toolchain/parse/context.h

@@ -107,8 +107,7 @@ class Context {
                               kind != NodeKind::InvalidParseStart &&
                               kind != NodeKind::InvalidParseSubtree),
                 "{0} nodes must always have an error", kind);
-    tree_->node_impls_.push_back(
-        {.kind = kind, .has_error = has_error, .token = token});
+    tree_->node_impls_.emplace_back(kind, has_error, token);


It's (sadly) important to use push_back here. Without that, this is a measurable net regression on parsing on M1.

Suggested change

tree_->node_impls_.emplace_back(kind, has_error, token);

tree_->node_impls_.push_back(Tree::NodeImpl(kind, has_error, token));

jonmeow requested a review from chandlerc December 7, 2024 00:34

github-actions bot requested a review from geoffromer December 7, 2024 00:34

github-actions bot added the toolchain label Dec 7, 2024

jonmeow removed the request for review from geoffromer December 7, 2024 00:34

jonmeow added 3 commits December 10, 2024 10:05

Add bit packing to NodeImpl

ef12c96

drop RawEnumType

8f13c73

Remove comment

0100a42

jonmeow force-pushed the pack-node-impl branch from 4be3f31 to 0100a42 Compare December 10, 2024 18:05

chandlerc reviewed Dec 13, 2024

View reviewed changes

chandlerc reviewed Dec 14, 2024

View reviewed changes

chandlerc approved these changes Dec 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bit packing to NodeImpl #4651

Add bit packing to NodeImpl #4651

jonmeow commented Dec 7, 2024 •

edited

Loading

chandlerc left a comment

chandlerc Dec 13, 2024

chandlerc Dec 14, 2024

chandlerc left a comment

chandlerc Dec 14, 2024

chandlerc left a comment

chandlerc Dec 14, 2024

		TokenKind kind_;
		static_assert(sizeof(kind_) == 1, "TokenKind must pack to 8 bits");

	tree_->node_impls_.emplace_back(kind, has_error, token);
	tree_->node_impls_.push_back(Tree::NodeImpl(kind, has_error, token));

Add bit packing to NodeImpl #4651

Are you sure you want to change the base?

Add bit packing to NodeImpl #4651

Conversation

jonmeow commented Dec 7, 2024 • edited Loading

chandlerc left a comment

Choose a reason for hiding this comment

chandlerc Dec 13, 2024

Choose a reason for hiding this comment

chandlerc Dec 14, 2024

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

chandlerc Dec 14, 2024

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

chandlerc Dec 14, 2024

Choose a reason for hiding this comment

jonmeow commented Dec 7, 2024 •

edited

Loading