-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bit packing to NodeImpl #4651
base: trunk
Are you sure you want to change the base?
Conversation
4be3f31
to
0100a42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the packing! =D A comment inline about the bitfield representation though. I'm also trying to get some benchmark numbers.
TokenKind kind_; | ||
static_assert(sizeof(kind_) == 1, "TokenKind must pack to 8 bits"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is changing more than maybe you intended?
Previously, all three fields are part of a single bitfield, allowing it to be loaded and stored as a single memory access. With this change, the kind is a separate field. I think this may make it more expensive to update the has_leading_space_
and token_payload_
, as writes to those cannot write to kind_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... This does change quite a lot...
The generated code in the lexer is very different after this change for example. But the change completely baffles me. It is significantly better with this change.
Previously, we would generate code sequences like on AArch64:
cmp w25, #0x0
mov w9, #0x72 // TokenKind 0x72
mov w10, #0x172 // TokenKind 0x72 | (1 << 8)
csel w9, w10, w9, ne // ne = any
ldr x10, [x19, #112]
adrp x11, _DYNAMIC+0xdb60
orr w9, w9, w24, lsl #9
ldr x11, [x11, #352]
orr x9, x9, x23, lsl #32
str x9, [x10, w8, uxtw #3]
This is pre-merging the constant token kind in this code path with the flag for leading space, then selecting based on whether the flag was set, and then combining the rest... I... I have no idea why the previous bitfield oriented code produces this bizarre generated code.
With this PR, we get much nicer:
lsl x9, x23, #32
bfi x24, x25, #1, #23
ldr x10, [x19, #112]
adrp x12, _DYNAMIC+0xd9b8
mov w11, #0x72 // TokenKind 0x72
orr x9, x9, x24, lsl #8
ldr x12, [x12, #2552]
orr x9, x9, x11
str x9, [x10, w8, uxtw #3]
Here, x24
has the value of has_leading_space_
, either 0
or 1
. Then we do bfi
or "bitfield insert", to put the payload from bits 1-23. And then when we or
it with the byte index we do lsl #8
to shift it up by 8 more creating space for the token kind, and or-ing that kind in.
This is quite a bit better code. One instruction fewer, two micro-ops fewer by my reading, etc.
So, I'm quite surprised that the code intending to make it easy for LLVM to emit this well backfires, and this code which is a bit trickier for the optimizer still ends up better... But it does! So, yay?
This is a measurable difference in the lexer, about 1.3% on my M1.
Anyways, ignore my original comment on this PR. This change at least seems good. I may send a follow-up to add a comment to the code that surfaces how fragile the generated code quality is here and what kinds of options we've considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very surprisingly, this is a pretty clear regression for parsing for me, but a win for lexing.
The lexing improvement is because of the bitfield change. See below for the quite surprising analysis on what happened there.
I'm still trying to figure out why parsing is slowing down (and enough to eclipse the win from lexing).
TokenKind kind_; | ||
static_assert(sizeof(kind_) == 1, "TokenKind must pack to 8 bits"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... This does change quite a lot...
The generated code in the lexer is very different after this change for example. But the change completely baffles me. It is significantly better with this change.
Previously, we would generate code sequences like on AArch64:
cmp w25, #0x0
mov w9, #0x72 // TokenKind 0x72
mov w10, #0x172 // TokenKind 0x72 | (1 << 8)
csel w9, w10, w9, ne // ne = any
ldr x10, [x19, #112]
adrp x11, _DYNAMIC+0xdb60
orr w9, w9, w24, lsl #9
ldr x11, [x11, #352]
orr x9, x9, x23, lsl #32
str x9, [x10, w8, uxtw #3]
This is pre-merging the constant token kind in this code path with the flag for leading space, then selecting based on whether the flag was set, and then combining the rest... I... I have no idea why the previous bitfield oriented code produces this bizarre generated code.
With this PR, we get much nicer:
lsl x9, x23, #32
bfi x24, x25, #1, #23
ldr x10, [x19, #112]
adrp x12, _DYNAMIC+0xd9b8
mov w11, #0x72 // TokenKind 0x72
orr x9, x9, x24, lsl #8
ldr x12, [x12, #2552]
orr x9, x9, x11
str x9, [x10, w8, uxtw #3]
Here, x24
has the value of has_leading_space_
, either 0
or 1
. Then we do bfi
or "bitfield insert", to put the payload from bits 1-23. And then when we or
it with the byte index we do lsl #8
to shift it up by 8 more creating space for the token kind, and or-ing that kind in.
This is quite a bit better code. One instruction fewer, two micro-ops fewer by my reading, etc.
So, I'm quite surprised that the code intending to make it easy for LLVM to emit this well backfires, and this code which is a bit trickier for the optimizer still ends up better... But it does! So, yay?
This is a measurable difference in the lexer, about 1.3% on my M1.
Anyways, ignore my original comment on this PR. This change at least seems good. I may send a follow-up to add a comment to the code that surfaces how fragile the generated code quality is here and what kinds of options we've considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this LGTM with emplace_back
replaced by push_back
as I suggested below.
It's a wash performance wise from what I can tell in parse on an M1. I think the caching layer is just too good there for the size reduction to show up. But I don't see any particularly bad code and it does use less memory, so it seems like a good change.
@@ -107,8 +107,7 @@ class Context { | |||
kind != NodeKind::InvalidParseStart && | |||
kind != NodeKind::InvalidParseSubtree), | |||
"{0} nodes must always have an error", kind); | |||
tree_->node_impls_.push_back( | |||
{.kind = kind, .has_error = has_error, .token = token}); | |||
tree_->node_impls_.emplace_back(kind, has_error, token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's (sadly) important to use push_back
here. Without that, this is a measurable net regression on parsing on M1.
tree_->node_impls_.emplace_back(kind, has_error, token); | |
tree_->node_impls_.push_back(Tree::NodeImpl(kind, has_error, token)); |
Just a small packing optimization. We currently have 222
NodeKinds
, so this reduces us to just 30ish more we can add without needing to pack more. However, if we did, there would be a couple options for bringing the count down by reusingNodeKinds
and disambiguating based on the token kind (the 29 infix operators as an example). Or we could just undo this.I'm expecting this to yield a small improvement. I'll see if I can get better numbers since my machine's not really reliable, but here are some basic values.
Also suggesting to draw the use of
::RawEnumType
forTokenKind
, since bit packing appears to work without it. Hoping thestatic_assert
is easier for people to understand the size of the field.With the change:
Without the change: