Add `find`, `position`, `any`, and `all` #129

cuviper · 2016-10-29T07:47:34Z

This is an independent implementation from #128, just happened to be developed around the same time. I think the lack of unsafe is an advantage here, but performance needs to be compared.

cuviper · 2016-10-29T08:04:07Z

I've included some micro-benchmarks here, and made some comparisons with #128. The biggest differentiator appears to be the choice whether to check for completion within the folder loop, so I'll show each one both ways. This is running on Fedora 24 on an i7-2600 (4 cores plus HT).

Niko's as-is, without folder checking:

test find::parallel_find_first              ... bench:     179,795 ns/iter (+/- 3,823)
test find::parallel_find_last               ... bench:   2,195,248 ns/iter (+/- 655,387)
test find::parallel_find_middle             ... bench:   1,314,797 ns/iter (+/- 379,793)
test find::parallel_find_missing            ... bench:   5,203,538 ns/iter (+/- 6,236,318)

Niko's modified with folder-checking:

test find::parallel_find_first              ... bench:      16,240 ns/iter (+/- 404)
test find::parallel_find_last               ... bench:   2,767,995 ns/iter (+/- 848,077)
test find::parallel_find_middle             ... bench:   1,612,078 ns/iter (+/- 803,146)
test find::parallel_find_missing            ... bench:   5,449,556 ns/iter (+/- 5,694,417)

Mine modified without folder-checking:

test find::parallel_find_first              ... bench:     179,146 ns/iter (+/- 3,701)
test find::parallel_find_last               ... bench:   2,168,005 ns/iter (+/- 486,373)
test find::parallel_find_middle             ... bench:   1,364,573 ns/iter (+/- 280,549)
test find::parallel_find_missing            ... bench:   5,090,836 ns/iter (+/- 7,171,469)

Mine as-is with folder-checking:

test find::parallel_find_first              ... bench:      15,749 ns/iter (+/- 380)
test find::parallel_find_last               ... bench:   2,723,820 ns/iter (+/- 776,910)
test find::parallel_find_middle             ... bench:   1,672,928 ns/iter (+/- 690,176)
test find::parallel_find_missing            ... bench:   5,472,982 ns/iter (+/- 5,915,631)

cuviper · 2016-10-29T08:16:16Z

I guess the searches in these benchmarks are just too simple, so the overhead of an extra check/branch is actually noticeable. I think that's probably still the right thing to do in general, but that's just intuition.

nikomatsakis · 2016-10-29T09:46:00Z

@cuviper my thoughts:

I prefer a name like find_any to indicate that the semantics diverge from ordinary find (doesn't necessarily find the first one). Similarly, I'd prefer another name for position.
- This is a bit of a tough call but I feel like it should be a principle that if you change from iter to par_iter then it should stop compiling if the semantics are different. (Either because the name is different or because the signature has changed -- the latter is the case with fold.)
I'm torn about the unsafe thing:
- Of course less unsafe is good, but the *mut is pretty trivial and it feels like copying around larger values (e.g., a Vec or some such) might be a noticeable cost. I guess if I want to state that I had better construct a benchmark. =)
I agree that loading the "am I done" counter each loop iteration seems better, not sure why I didn't write it that way to start.

cuviper · 2016-10-29T19:00:28Z

I don't really agree about naming, but I don't feel strongly if that's what you want. To me, the semantic difference is relatively minor, and unsurprising if one thinks about what parallel find implies.
- Any suggestion for alt position? find_any is OK, but I don't like the sound of position_any.
Vec is a bad example for your copy concern, because its data is boxed away unmoved. I get what you're saying though, but note that the number of copies is at most just the number of splits, something like O(log N). So I think this a minor overhead compared to the search itself, not worth unsafe hacking.
It can go either way, but for more expensive searches I think we'll want to escape ASAP.

cuviper · 2016-10-29T19:03:07Z

Rebased to solve the merge conflict.

nikomatsakis · 2016-10-31T13:24:21Z

@cuviper

To me, the semantic difference is relatively minor, and unsurprising if one thinks about what parallel find implies.

Yeah, I can't decide how to weigh the convenience that things "just work" versus subtle changes in semantics. Part of my thought process was that it would be useful to also have a find_first that always found the first result -- it would basically keep a counter with the lowest matching index, and cut off searches as appropriate. I'd probably avoid calling that find because it's likely not what you want by default, since it's less efficient. Similarly, one could imagine position_{any,first}, though I agree the names are less good.

Vec is a bad example for your copy concern, because its data is boxed away unmoved.

I'd expect your code to perform as well on vecs, but I tossed it out there because I'd want to at least measure something larger than a word. =) I had planned to do some kind of benchmark using [u32; N] for various values of N (lack of const generics means that would require macros, I guess, but oh well). I also thought it'd be nice to have some benchmarks for things like "match occurs N% of the time".

I get what you're saying though, but note that the number of copies is at most just the number of splits, something like O(log N). So I think this a minor overhead compared to the search itself, not worth unsafe hacking.

Yeah, having slept on it a bit, I'm inclined to say let's just take your branch -- if we find the overhead is a problem for some example, we can move to unsafe code later. I do like having the iterators be largely safe code, as long as there isn't a strong performance case.

cuviper · 2016-10-31T18:26:22Z

I think I'd prefer a Mutex<Option<T>> before dipping back into unsafe code. I've been bitten too many times by subtle concurrency memory-ordering issues (in other languages), so I'd avoid the risk, even though your unsafe code does look trivially correct.

cuviper · 2016-10-31T18:29:13Z

(I'd still keep the relaxed atomic bool as a hint for early exits, so there wouldn't be a ton of mutex serialization.)

cuviper · 2016-11-01T05:24:33Z

I went ahead and tried it, so we can find a better middle ground. :) I now have a Mutex around the shared result location, and the rest of the Consumer and Finder implementations are greatly simplified. I added one more "common" benchmark which looks for any x % 1000 == 999 in the random haystack.

test find::parallel_find_common             ... bench:      20,295 ns/iter (+/- 853)
test find::parallel_find_first              ... bench:      16,082 ns/iter (+/- 412)
test find::parallel_find_last               ... bench:   2,695,487 ns/iter (+/- 947,960)
test find::parallel_find_middle             ... bench:   1,642,146 ns/iter (+/- 631,507)
test find::parallel_find_missing            ... bench:   5,668,282 ns/iter (+/- 6,475,071)

nikomatsakis · 2016-11-01T09:22:18Z

@cuviper

I think I'd prefer a Mutex<Option> before dipping back into unsafe code.

To me it all comes down to benchmarks, but I suspect I am mildly more tolerant of unsafe code, particularly if it is simple. Using both a mutex and an atomic feels sad to me. =) That said, I do appreciate your wariness! Still, I'd probably lean towards your original version (or mine, if perf were an issue).

One thing about mutexes that's worth noting is that their performance varies dramatically on different operating systems-- what O/S were those measurements taken on?

(Also, you could eliminate the possibility of mutex contention by leveraging the atomic a bit better. e.g., use a seq-cst swap, as I did in my code, and then acquire the mutex only if you know that you will be the winner.)

nikomatsakis · 2016-11-01T14:36:33Z

OK @cuviper I went ahead and whipped up some more benchmarks. In particular, I replicated your benchmarks but with data of size 1 (as you had it), 64, and 256. Here are the results, collected into a gist. I did 3 runs of each but just collected the results into a gist. Honestly, I haven't had time to even analyze them yet!

nikomatsakis · 2016-11-01T14:38:48Z

OK, I think the TL;DR is that it doesn't make much difference what scheme we use. Here is some analysis to show you what I mean. =)

lunch-box. grep find::size256::parallel_find_common *.txt
killme.find-atomic.1.txt:test find::size256::parallel_find_common    ... bench:      40,772 ns/iter (+/- 2,295)
killme.find-atomic.2.txt:test find::size256::parallel_find_common    ... bench:      40,684 ns/iter (+/- 2,272)
killme.find-atomic.3.txt:test find::size256::parallel_find_common    ... bench:      40,860 ns/iter (+/- 2,644)
killme.find-mutex.1.txt:test find::size256::parallel_find_common    ... bench:      40,174 ns/iter (+/- 2,653)
killme.find-mutex.2.txt:test find::size256::parallel_find_common    ... bench:      40,075 ns/iter (+/- 2,149)
killme.find-mutex.3.txt:test find::size256::parallel_find_common    ... bench:      40,648 ns/iter (+/- 2,309)
killme.find-reduce.1.txt:test find::size256::parallel_find_common    ... bench:      39,504 ns/iter (+/- 2,108)
killme.find-reduce.2.txt:test find::size256::parallel_find_common    ... bench:      39,767 ns/iter (+/- 2,459)
killme.find-reduce.3.txt:test find::size256::parallel_find_common    ... bench:      40,136 ns/iter (+/- 2,875)
lunch-box. grep find::size256::parallel_find_first *.txt
killme.find-atomic.1.txt:test find::size256::parallel_find_first     ... bench:      44,298 ns/iter (+/- 3,264)
killme.find-atomic.2.txt:test find::size256::parallel_find_first     ... bench:      43,782 ns/iter (+/- 8,295)
killme.find-atomic.3.txt:test find::size256::parallel_find_first     ... bench:      44,442 ns/iter (+/- 2,801)
killme.find-mutex.1.txt:test find::size256::parallel_find_first     ... bench:      43,112 ns/iter (+/- 1,575)
killme.find-mutex.2.txt:test find::size256::parallel_find_first     ... bench:      43,669 ns/iter (+/- 2,234)
killme.find-mutex.3.txt:test find::size256::parallel_find_first     ... bench:      43,566 ns/iter (+/- 2,593)
killme.find-reduce.1.txt:test find::size256::parallel_find_first     ... bench:      43,555 ns/iter (+/- 2,509)
killme.find-reduce.2.txt:test find::size256::parallel_find_first     ... bench:      43,231 ns/iter (+/- 2,200)
killme.find-reduce.3.txt:test find::size256::parallel_find_first     ... bench:      43,945 ns/iter (+/- 2,632)
lunch-box. grep find::size256::parallel_find_last *.txt
killme.find-atomic.1.txt:test find::size256::parallel_find_last      ... bench:  44,260,038 ns/iter (+/- 10,718,065)
killme.find-atomic.2.txt:test find::size256::parallel_find_last      ... bench:  45,207,785 ns/iter (+/- 12,503,029)
killme.find-atomic.3.txt:test find::size256::parallel_find_last      ... bench:  44,537,781 ns/iter (+/- 14,834,277)
killme.find-mutex.1.txt:test find::size256::parallel_find_last      ... bench:  46,391,296 ns/iter (+/- 10,870,670)
killme.find-mutex.2.txt:test find::size256::parallel_find_last      ... bench:  46,068,558 ns/iter (+/- 14,485,740)
killme.find-mutex.3.txt:test find::size256::parallel_find_last      ... bench:  44,435,522 ns/iter (+/- 13,130,128)
killme.find-reduce.1.txt:test find::size256::parallel_find_last      ... bench:  44,068,219 ns/iter (+/- 13,201,860)
killme.find-reduce.2.txt:test find::size256::parallel_find_last      ... bench:  45,031,003 ns/iter (+/- 13,357,368)
killme.find-reduce.3.txt:test find::size256::parallel_find_last      ... bench:  44,743,122 ns/iter (+/- 11,760,751)
lunch-box. grep find::size256::parallel_find_missing *.txt
killme.find-atomic.1.txt:test find::size256::parallel_find_missing   ... bench:  53,399,440 ns/iter (+/- 158,111)
killme.find-atomic.2.txt:test find::size256::parallel_find_missing   ... bench:  52,843,421 ns/iter (+/- 134,126)
killme.find-atomic.3.txt:test find::size256::parallel_find_missing   ... bench:  52,847,156 ns/iter (+/- 273,653)
killme.find-mutex.1.txt:test find::size256::parallel_find_missing   ... bench:  52,768,235 ns/iter (+/- 165,680)
killme.find-mutex.2.txt:test find::size256::parallel_find_missing   ... bench:  52,778,307 ns/iter (+/- 176,499)
killme.find-mutex.3.txt:test find::size256::parallel_find_missing   ... bench:  52,770,511 ns/iter (+/- 161,761)
killme.find-reduce.1.txt:test find::size256::parallel_find_missing   ... bench:  52,822,863 ns/iter (+/- 87,574)
killme.find-reduce.2.txt:test find::size256::parallel_find_missing   ... bench:  52,792,112 ns/iter (+/- 174,477)
killme.find-reduce.3.txt:test find::size256::parallel_find_missing   ... bench:  52,764,203 ns/iter (+/- 128,669)

nikomatsakis · 2016-11-01T14:39:27Z

They're so close I have to wonder if I screwed it up somehow. But I'm pretty sure I didn't. Anyway, the source is on my branch; you can switch the impl by changing the #[cfg].

edre · 2016-11-01T14:44:52Z

How many cores are you using? The choice of atomic primitive may have different performance behaviors with higher/lower contention.

Anyway, if the performance is in the noise, I like the reduce implementation best, as the semantics are clearer to me and I think it would have the least contention with many cores.

nikomatsakis · 2016-11-01T14:56:04Z

@edre oh yeah should have clarified; this would be an 8x2-core CPU, I believe, running linux. I also agree that the reduce-based version is the best if perf is the same.

nikomatsakis · 2016-11-01T16:23:27Z

@cuviper so I thought about it some more, and I think we should call it find_any. I would like to maintain the principle that if you use par iters, your code either does the same thing, or doesn't compile, modulo atomics and a few other nitpicky things, like float addition not being commutative.

Basically, it seems like switching to par_iter could otherwise introduce a very subtle bug. position is annoying but I'd go for position_any just for consistency and predictability (any_position is a bit more grammatical though).

One other thought I had is keeping find (and position) but "pre-deprecating" it, and referring people to find_any and (position_any) instead.

cuviper · 2016-11-01T17:05:58Z

I suspect I am mildly more tolerant of unsafe code, particularly if it is simple.

You might well be more tolerant, but you're also more aware than most of the gray areas around unsafe code, Mr. Tootsie Pop. :)

I'm moderately OK with unsafe code for memory tricks, like I used to implement par_iter vector, but I'm really wary of hand-written synchronization.

Using both a mutex and an atomic feels sad to me. =)

I suppose, but it's with reason. The mutex is used to gate concurrent access, as mutexes do. The atomic is simply a global flag which doesn't need any synchronization since it's only one-way (false->true), and in a looser language I probably wouldn't even use an atomic construct.

One thing about mutexes that's worth noting is that their performance varies dramatically on different operating systems-- what O/S were those measurements taken on?

Linux - Fedora 24 x86_64. But the performance of this mutex doesn't really matter, as it's only ever called in the race to the finish. At worst it will only be called once in each rayon thread, and even that is uncontended because I only used try_lock.

(Also, you could eliminate the possibility of mutex contention by leveraging the atomic a bit better. e.g., use a seq-cst swap, as I did in my code, and then acquire the mutex only if you know that you will be the winner.)

This is the sort of thing try_lock already does, at least with Linux futexes.

OK, I think the TL;DR is that it doesn't make much difference what scheme we use.

Thanks for measuring! I'm not really surprised that these variations are in the noise -- the reduce version has only a handful of copies as I mentioned before, and the mutex version only ever even tries to lock once per thread. And your version is roughly just an unsafe-hand-rolled equivalent to my mutex version.

cuviper · 2016-11-01T17:18:44Z

So the TODOs:

Revert to the reduce-based implementation.
Rename to find_any and position_any
Add pre-deprecated find and position so users get a clue.
Incorporate the expanded benchmark sizes.

Anything else? I'll work on this more tonight...

nikomatsakis · 2016-11-01T17:25:19Z

@cuviper I think I'd leave off the expanded benchmark sizes; maybe incorporate just the macro. The main reason is that the larger sizes take forever to run, and they didn't prove particularly informative.

Thanks for measuring! I'm not really surprised that these variations are in the noise -- the reduce version has only a handful of copies as I mentioned before, and the mutex version only ever even tries to lock once per thread. And your version is roughly just an unsafe-hand-rolled equivalent to my mutex version.

Indeed. It makes sense, but always good to check.

nikomatsakis · 2016-11-01T21:46:33Z

@cuviper (but yes that looks great!)

nikomatsakis · 2016-11-02T08:13:48Z

Looks good to me! Nice.

cuviper added 3 commits October 29, 2016 12:01

Add find, position, any, and all

f761584

Borrow @nikomatsakis' docstring and tests from rayon-rs#128

2612c32

Add docstrings for position, any, and all

2013fc7

cuviper force-pushed the find branch from 5602da5 to 2013fc7 Compare October 29, 2016 19:02

Use a Mutex for the find result

6482764

nikomatsakis mentioned this pull request Nov 1, 2016

Implement ability to find something that matches #128

Closed

nikomatsakis and others added 4 commits November 1, 2016 21:41

parameterize tests by size

1a5e6cd

comment out the larger find benchmarks

5702814

Revert to the reduce-based find

5bb0ef0

Switch to find_any and position_any, with deprecated placeholders

47dcf33

nikomatsakis merged commit 6ed25a6 into rayon-rs:master Nov 2, 2016

cuviper deleted the find branch November 25, 2016 07:56

cuviper mentioned this pull request Nov 30, 2016

"find" operation on parallel iterators? #51

Closed

cuviper mentioned this pull request Jan 16, 2019

Add a parallel equivalent to Iterator::find_map #607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `find`, `position`, `any`, and `all` #129

Add `find`, `position`, `any`, and `all` #129

cuviper commented Oct 29, 2016

cuviper commented Oct 29, 2016

cuviper commented Oct 29, 2016

nikomatsakis commented Oct 29, 2016

cuviper commented Oct 29, 2016

cuviper commented Oct 29, 2016

nikomatsakis commented Oct 31, 2016

cuviper commented Oct 31, 2016

cuviper commented Oct 31, 2016

cuviper commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016 •

edited

Loading

nikomatsakis commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016

edre commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016 •

edited

Loading

nikomatsakis commented Nov 1, 2016

cuviper commented Nov 1, 2016

cuviper commented Nov 1, 2016 •

edited

Loading

nikomatsakis commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016

nikomatsakis commented Nov 2, 2016

Add find, position, any, and all #129

Add find, position, any, and all #129

Conversation

cuviper commented Oct 29, 2016

cuviper commented Oct 29, 2016

cuviper commented Oct 29, 2016

nikomatsakis commented Oct 29, 2016

cuviper commented Oct 29, 2016

cuviper commented Oct 29, 2016

nikomatsakis commented Oct 31, 2016

cuviper commented Oct 31, 2016

cuviper commented Oct 31, 2016

cuviper commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016 • edited Loading

nikomatsakis commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016

edre commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016 • edited Loading

nikomatsakis commented Nov 1, 2016

cuviper commented Nov 1, 2016

cuviper commented Nov 1, 2016 • edited Loading

nikomatsakis commented Nov 1, 2016

nikomatsakis commented Nov 1, 2016

nikomatsakis commented Nov 2, 2016

Add `find`, `position`, `any`, and `all` #129

Add `find`, `position`, `any`, and `all` #129

nikomatsakis commented Nov 1, 2016 •

edited

Loading

nikomatsakis commented Nov 1, 2016 •

edited

Loading

cuviper commented Nov 1, 2016 •

edited

Loading