-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slightly faster keyword lookups #1591
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -973,3 +973,61 @@ pub const RESERVED_FOR_IDENTIFIER: &[Keyword] = &[ | |||||
Keyword::STRUCT, | ||||||
Keyword::TRIM, | ||||||
]; | ||||||
|
||||||
pub const NA: usize = usize::MAX; | ||||||
|
||||||
#[rustfmt::skip] | ||||||
pub const KEYWORD_LOOKUP_INDEX_ROOT: &[usize; 26] = &[ | ||||||
0, 42, 67, 148, 198, 241, 281, 294, 305, 350, 357, 360, 390, | ||||||
430, 465, 497, 539, 543, 605, 683, 728, 761, 780, 793, 795, 796, | ||||||
]; | ||||||
|
||||||
pub fn lookup(word: &str) -> Keyword { | ||||||
if word.len() < 2 { | ||||||
return Keyword::NoKeyword; | ||||||
} | ||||||
|
||||||
let word = word.to_uppercase(); | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we can figure out how to remove this to_uppercase call I think this approach will likely be very fast. I dug around in the rust API docs and it seems like we could change the comparison to ignore ascii case. See comment below 🤔 |
||||||
let byte1 = word.as_bytes()[0]; | ||||||
if !byte1.is_ascii_uppercase() { | ||||||
return Keyword::NoKeyword; | ||||||
} | ||||||
|
||||||
let start = KEYWORD_LOOKUP_INDEX_ROOT[(byte1 - b'A') as usize]; | ||||||
|
||||||
let end = if (byte1 + 1) <= b'Z' { | ||||||
KEYWORD_LOOKUP_INDEX_ROOT[(byte1 - b'A' + 1) as usize] | ||||||
} else { | ||||||
ALL_KEYWORDS.len() | ||||||
}; | ||||||
|
||||||
let keyword = ALL_KEYWORDS[start..end].binary_search(&word.as_str()); | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we could use something like
Suggested change
https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search_by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually tried to do this approach, and I found the I think we can change it to |
||||||
keyword.map_or(Keyword::NoKeyword, |x| ALL_KEYWORDS_INDEX[x + start]) | ||||||
} | ||||||
|
||||||
#[cfg(test)] | ||||||
mod tests { | ||||||
use super::*; | ||||||
|
||||||
#[test] | ||||||
fn check_keyword_index_roots() { | ||||||
let mut root_index = Vec::with_capacity(26); | ||||||
root_index.push(0); | ||||||
for idx in 1..ALL_KEYWORDS.len() { | ||||||
assert!(ALL_KEYWORDS[idx - 1] < ALL_KEYWORDS[idx]); | ||||||
let prev = ALL_KEYWORDS[idx - 1].as_bytes()[0]; | ||||||
let curr = ALL_KEYWORDS[idx].as_bytes()[0]; | ||||||
if curr != prev { | ||||||
root_index.push(idx); | ||||||
} | ||||||
} | ||||||
assert_eq!(&root_index, KEYWORD_LOOKUP_INDEX_ROOT); | ||||||
} | ||||||
|
||||||
#[test] | ||||||
fn check_keyword_lookup() { | ||||||
for idx in 0..ALL_KEYWORDS.len() { | ||||||
assert_eq!(lookup(ALL_KEYWORDS[idx]), ALL_KEYWORDS_INDEX[idx]); | ||||||
} | ||||||
} | ||||||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not super maintainable (needs to be updated manually when adding keywords).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - it we go with this type of table driven approach we should have some sort of update script (or build.rs) that builds the table from the enum.