Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bit packing to NodeImpl #4651

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

jonmeow
Copy link
Contributor

@jonmeow jonmeow commented Dec 7, 2024

Just a small packing optimization. We currently have 222 NodeKinds, so this reduces us to just 30ish more we can add without needing to pack more. However, if we did, there would be a couple options for bringing the count down by reusing NodeKinds and disambiguating based on the token kind (the 29 infix operators as an example). Or we could just undo this.

I'm expecting this to yield a small improvement. I'll see if I can get better numbers since my machine's not really reliable, but here are some basic values.

Also suggesting to draw the use of ::RawEnumType for TokenKind, since bit packing appears to work without it. Hoping the static_assert is easier for people to understand the size of the field.

With the change:

----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations      Bytes      Lines     Tokens
----------------------------------------------------------------------------------------------------------------------------
BM_CompileAPIFileDenseDecls<Phase::Parse>/256         50399 ns        50359 ns        14336 104.588M/s 3.87217M/s 21.8629M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024       237823 ns       237629 ns         3072 136.721M/s 4.11986M/s 24.2058M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096       997645 ns       996771 ns          768 142.343M/s 4.04105M/s 23.9363M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384     4020308 ns      4018319 ns          192 152.041M/s 4.05966M/s 24.0874M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536    16691390 ns     16683058 ns           48 151.317M/s 3.92374M/s 23.2936M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144   75265735 ns     75233476 ns            8 135.842M/s 3.48421M/s 20.6862M/s

Without the change:

----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations      Bytes      Lines     Tokens
----------------------------------------------------------------------------------------------------------------------------
BM_CompileAPIFileDenseDecls<Phase::Parse>/256         51515 ns        51480 ns        13312 102.312M/s 3.78789M/s  21.387M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024       241040 ns       240900 ns         3072 134.865M/s 4.06392M/s 23.8771M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096       985593 ns       984657 ns          768 144.094M/s 4.09077M/s 24.2308M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384     4109327 ns      4105496 ns          192 148.813M/s 3.97345M/s  23.576M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536    17459655 ns     17446006 ns           48   144.7M/s 3.75215M/s  22.275M/s
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144   80802815 ns     80737489 ns            8 126.581M/s 3.24668M/s  19.276M/s

@jonmeow jonmeow requested a review from chandlerc December 7, 2024 00:34
@github-actions github-actions bot requested a review from geoffromer December 7, 2024 00:34
@jonmeow jonmeow removed the request for review from geoffromer December 7, 2024 00:34
Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the packing! =D A comment inline about the bitfield representation though. I'm also trying to get some benchmark numbers.

Comment on lines +402 to +403
TokenKind kind_;
static_assert(sizeof(kind_) == 1, "TokenKind must pack to 8 bits");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is changing more than maybe you intended?

Previously, all three fields are part of a single bitfield, allowing it to be loaded and stored as a single memory access. With this change, the kind is a separate field. I think this may make it more expensive to update the has_leading_space_ and token_payload_, as writes to those cannot write to kind_.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... This does change quite a lot...

The generated code in the lexer is very different after this change for example. But the change completely baffles me. It is significantly better with this change.

Previously, we would generate code sequences like on AArch64:

       cmp   w25, #0x0
       mov   w9, #0x72                       // TokenKind 0x72
       mov   w10, #0x172                     // TokenKind 0x72 | (1 << 8)
       csel  w9, w10, w9, ne // ne = any
       ldr   x10, [x19, #112]
       adrp  x11, _DYNAMIC+0xdb60
       orr   w9, w9, w24, lsl #9
       ldr   x11, [x11, #352]
       orr   x9, x9, x23, lsl #32
       str   x9, [x10, w8, uxtw #3]

This is pre-merging the constant token kind in this code path with the flag for leading space, then selecting based on whether the flag was set, and then combining the rest... I... I have no idea why the previous bitfield oriented code produces this bizarre generated code.

With this PR, we get much nicer:

       lsl   x9, x23, #32
       bfi   x24, x25, #1, #23
       ldr   x10, [x19, #112]
       adrp  x12, _DYNAMIC+0xd9b8
       mov   w11, #0x72                      // TokenKind 0x72
       orr   x9, x9, x24, lsl #8
       ldr   x12, [x12, #2552]
       orr   x9, x9, x11
       str   x9, [x10, w8, uxtw #3]

Here, x24 has the value of has_leading_space_, either 0 or 1. Then we do bfi or "bitfield insert", to put the payload from bits 1-23. And then when we or it with the byte index we do lsl #8 to shift it up by 8 more creating space for the token kind, and or-ing that kind in.

This is quite a bit better code. One instruction fewer, two micro-ops fewer by my reading, etc.

So, I'm quite surprised that the code intending to make it easy for LLVM to emit this well backfires, and this code which is a bit trickier for the optimizer still ends up better... But it does! So, yay?

This is a measurable difference in the lexer, about 1.3% on my M1.

Anyways, ignore my original comment on this PR. This change at least seems good. I may send a follow-up to add a comment to the code that surfaces how fragile the generated code quality is here and what kinds of options we've considered.

Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very surprisingly, this is a pretty clear regression for parsing for me, but a win for lexing.

The lexing improvement is because of the bitfield change. See below for the quite surprising analysis on what happened there.

I'm still trying to figure out why parsing is slowing down (and enough to eclipse the win from lexing).

Comment on lines +402 to +403
TokenKind kind_;
static_assert(sizeof(kind_) == 1, "TokenKind must pack to 8 bits");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... This does change quite a lot...

The generated code in the lexer is very different after this change for example. But the change completely baffles me. It is significantly better with this change.

Previously, we would generate code sequences like on AArch64:

       cmp   w25, #0x0
       mov   w9, #0x72                       // TokenKind 0x72
       mov   w10, #0x172                     // TokenKind 0x72 | (1 << 8)
       csel  w9, w10, w9, ne // ne = any
       ldr   x10, [x19, #112]
       adrp  x11, _DYNAMIC+0xdb60
       orr   w9, w9, w24, lsl #9
       ldr   x11, [x11, #352]
       orr   x9, x9, x23, lsl #32
       str   x9, [x10, w8, uxtw #3]

This is pre-merging the constant token kind in this code path with the flag for leading space, then selecting based on whether the flag was set, and then combining the rest... I... I have no idea why the previous bitfield oriented code produces this bizarre generated code.

With this PR, we get much nicer:

       lsl   x9, x23, #32
       bfi   x24, x25, #1, #23
       ldr   x10, [x19, #112]
       adrp  x12, _DYNAMIC+0xd9b8
       mov   w11, #0x72                      // TokenKind 0x72
       orr   x9, x9, x24, lsl #8
       ldr   x12, [x12, #2552]
       orr   x9, x9, x11
       str   x9, [x10, w8, uxtw #3]

Here, x24 has the value of has_leading_space_, either 0 or 1. Then we do bfi or "bitfield insert", to put the payload from bits 1-23. And then when we or it with the byte index we do lsl #8 to shift it up by 8 more creating space for the token kind, and or-ing that kind in.

This is quite a bit better code. One instruction fewer, two micro-ops fewer by my reading, etc.

So, I'm quite surprised that the code intending to make it easy for LLVM to emit this well backfires, and this code which is a bit trickier for the optimizer still ends up better... But it does! So, yay?

This is a measurable difference in the lexer, about 1.3% on my M1.

Anyways, ignore my original comment on this PR. This change at least seems good. I may send a follow-up to add a comment to the code that surfaces how fragile the generated code quality is here and what kinds of options we've considered.

Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this LGTM with emplace_back replaced by push_back as I suggested below.

It's a wash performance wise from what I can tell in parse on an M1. I think the caching layer is just too good there for the size reduction to show up. But I don't see any particularly bad code and it does use less memory, so it seems like a good change.

@@ -107,8 +107,7 @@ class Context {
kind != NodeKind::InvalidParseStart &&
kind != NodeKind::InvalidParseSubtree),
"{0} nodes must always have an error", kind);
tree_->node_impls_.push_back(
{.kind = kind, .has_error = has_error, .token = token});
tree_->node_impls_.emplace_back(kind, has_error, token);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's (sadly) important to use push_back here. Without that, this is a measurable net regression on parsing on M1.

Suggested change
tree_->node_impls_.emplace_back(kind, has_error, token);
tree_->node_impls_.push_back(Tree::NodeImpl(kind, has_error, token));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants