-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use memchr for str::find(char) #46735
Conversation
I haven't really tested this much, there probably are failures. Will do a second pass at self-review once I know we pass all tests (from travis) |
The #[bench]
fn find_char(b: &mut Bencher) {
let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
b.iter(|| test::black_box(x.find('/')));
}
#[bench]
fn find_char_memchr(b: &mut Bencher) {
let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
b.iter(|| test::black_box(memchr::memchr(b'/', x.as_bytes())));
} Before:
After:
|
This does not bring improvements for multibyte chars or for For When the thing we're searching for is not ASCII we can still search for the first byte. However for most UTF8 text the first byte will generally be pretty uniform; i.e. if it's Arabic text will usually be 0xD8 or 0xD9, Korean will be 0xEA, 0xEB, 0xEC, or 0xED, Devanagari is usually 0xE0, etc. This means that memchr will have lots of false positives; we'll get lots of hits on the first byte and then have to check the second byte. This amount of stutter will probably make memchr's (minor) fixed overhead significant, and destroy any perf gains which we may get. Searching for the second byte or even better, the last byte, might work better. But I'm not sure if I want to write that code right now, and the tradeoffs are a bit trickier there :) |
93216f1
to
f865164
Compare
876e2a1
to
75c07a3
Compare
Bench numbers do not materially change with the UTF8 changes. I did come up with a pathological case of searching a Devanagari string for I think this pathological case is ok, it will only arise when mixing languages and for very specific characters. I can check some form of these benchmarks into tree if y'all feel it necessary.
Code: #[bench]
fn find_char(b: &mut Bencher) {
let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
b.iter(|| test::black_box(x.find('/')));
}
#[bench]
fn find_char_memchr(b: &mut Bencher) {
let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
b.iter(|| test::black_box(memchr::memchr(b'/', x.as_bytes())));
}
#[bench]
fn find_multibyte_char_found(b: &mut Bencher) {
let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ก remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
b.iter(|| test::black_box(x.find('ก')));
}
#[bench]
fn find_multibyte_char_notfound(b: &mut Bencher) {
let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
b.iter(|| test::black_box(x.find('ก')));
}
#[bench]
fn find_multibyte_string_multibyte_char(b: &mut Bencher) {
let x = test::black_box("जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली");
b.iter(|| test::black_box(x.find('ग'))); // not in the string
}
#[bench]
fn find_multibyte_string_pathological(b: &mut Bencher) {
let x = test::black_box("जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली");
b.iter(|| test::black_box(x.find('ä'))); // ä's last byte is found often in Devanagari text
} |
If we really care about the pathological case it can be avoided by having some check in the loop that after X false positives falls back to regular "loop on next" behavior. I don't think we should, though. We could also write some monster SSE-enabled memchr that can search for up to 4 byte units. I'm not doing that. |
@bors-servo try |
Use memchr for str::find(char) This is a 10x improvement for searching for characters. This also contains the patches from #46713 . Feel free to land both separately or together. cc @mystor @alexcrichton r? @bluss fixes #46693
☀️ Test successful - status-travis |
@rust-lang/infra could I get a perf.rlo diff result from this try push?
…On Dec 20, 2017 8:49 PM, "bors" ***@***.***> wrote:
☀️ Test successful - status-travis
<https://travis-ci.org/rust-lang/rust/builds/319461824?utm_source=github_status&utm_medium=notification>
State: approved= try=True
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#46735 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABivSPUEZFw5P1CqObGiqo0cQOlmsL8Oks5tCbkpgaJpZM4RCuKw>
.
|
b6f2d90
to
85919a0
Compare
cc @Mark-Simulacrum ^ |
Searching for the last byte is indeed a better heuristic on UTF-8 than searching for the first byte. You'd be in good company (GNU grep does that). But the last byte is still arbitrary. This is why the regex crate ranks every byte in order of what it believes is rare. Leading UTF-8 bytes are considered common while trailing bytes aren't. But you also get things like " (To be clear, I think the frequency rank stuff is probably overkill for searching a single |
#[inline] | ||
fn next(&mut self) -> SearchStep { | ||
let old_finger = self.finger; | ||
let slice = unsafe { self.haystack.get_unchecked(old_finger..self.haystack.len()) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the various bounds check elisions actually help here? I've tried eliding them in my own substring search algorithms and it meets with variable success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think they do, but I haven't checked and it seemed pretty easy to keep that invariant. I can check if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general position has been to not elide bounds checks unless I'm pretty sure that it matters. If it were me, I'd remove the unsafe
. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do some checking later this week, for now I'll land it.
This LGTM! Nice work @Manishearth :-) |
yay :)
phew that sounds trickier to get right 😄
Yeah, we do an interesting but non-memchry algorithm. I considered retrofitting the existing memchr'd
can this be landed r=you? I've made a small mistake which I need to rectify, aside from that it seems basically ready. Or should we wait for second review? |
@Manishearth Yeah r=me sounds great. |
Perf queued; in the future please ping me directly. |
@bors r=burntsushi |
📌 Commit 5cf5516 has been approved by |
Use memchr for str::find(char) This is a 10x improvement for searching for characters. This also contains the patches from #46713 . Feel free to land both separately or together. cc @mystor @alexcrichton r? @bluss fixes #46693
☀️ Test successful - status-appveyor, status-travis |
rep | ||
} | ||
|
||
/// Return the first index matching the byte `a` in `text`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a
is meant to be x
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, fixing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
someone fixed it already
This is a 10x improvement for searching for characters.
This also contains the patches from #46713 . Feel free to land both separately or together.
cc @mystor @alexcrichton
r? @bluss
fixes #46693