Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find keywords using perfect hashing #1590

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

davisp
Copy link
Member

@davisp davisp commented Dec 11, 2024

This updates the keyword lookup algorthm to use the phf crate.

This is one possible approach to solving #1588 at the cost of pulling in a new dependency on phf. I will note that phf is rather small and is being used in a no_std configuration, but it is a new dependency which may be enough to not go this route.

Local measuring with the sqlparser_bench shows roughly a 5% speedup. I'll open a second PR for comparison with my other approach to just scope the binary search to words starting with the same first letter.

This updates the keyword lookup algorthm to use the phf crate.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for thsi @davisp

];
};
pub fn lookup(keyword: &str) -> Keyword {
let keyword = keyword.to_ascii_uppercase();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to optimize lookup I think we should also avoid this call to to_ascii_upprcase as it allocates / copies the string which I believe is significiantly more costly than some comparisons

https://doc.rust-lang.org/std/ascii/trait.AsciiExt.html#tymethod.to_ascii_uppercase
https://doc.rust-lang.org/std/primitive.str.html#method.to_uppercase

Since all keywords are ASCII, what if we did the binary search character by character

Here is some brain 🤮 :

let first_char = string.first().to_upper() 
// find range of keywords within the
let range = binary_search(keywords, first_char, 0);
// if not unique location, proceed to second character, etc
...

use std::path::Path;

fn read_keywords() -> Vec<(String, Option<String>)> {
let path = Path::new("src").join("keywords.txt");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite clever

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

As much as I love not rewriting code, I am quite loath to take on a new dependency in this crate. My concerns are

  1. The binary / dependency size
  2. (mostly) that every additional crate used results in another places to help maintain (see for example CI Failure in json writer tests after upgrade from lexical-core 1.0.2 to lexical-core 1.0.3: range end index 20 out of range for slice of length 19 arrow-rs#6858)

That being said, phf appears to be widely used and not changing: https://crates.io/crates/phf/reverse_dependencies

Are you wiling to try the "compare without allocating approach" described in #1591 (comment)?

@tobyhede
Copy link
Contributor

Also an open question regarding maintenance of the phf crate rust-phf/rust-phf#318
It does appear to be relatively stable, most of the open issues appear to be feature requests and enhancements rather than fixes.

@alamb alamb marked this pull request as draft December 24, 2024 15:00
@alamb
Copy link
Contributor

alamb commented Dec 24, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants