Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add find, position, any, and all #129

Merged
merged 8 commits into from
Nov 2, 2016
Merged

Conversation

cuviper
Copy link
Member

@cuviper cuviper commented Oct 29, 2016

This is an independent implementation from #128, just happened to be developed around the same time. I think the lack of unsafe is an advantage here, but performance needs to be compared.

@cuviper
Copy link
Member Author

cuviper commented Oct 29, 2016

I've included some micro-benchmarks here, and made some comparisons with #128. The biggest differentiator appears to be the choice whether to check for completion within the folder loop, so I'll show each one both ways. This is running on Fedora 24 on an i7-2600 (4 cores plus HT).

Niko's as-is, without folder checking:

test find::parallel_find_first              ... bench:     179,795 ns/iter (+/- 3,823)
test find::parallel_find_last               ... bench:   2,195,248 ns/iter (+/- 655,387)
test find::parallel_find_middle             ... bench:   1,314,797 ns/iter (+/- 379,793)
test find::parallel_find_missing            ... bench:   5,203,538 ns/iter (+/- 6,236,318)

Niko's modified with folder-checking:

test find::parallel_find_first              ... bench:      16,240 ns/iter (+/- 404)
test find::parallel_find_last               ... bench:   2,767,995 ns/iter (+/- 848,077)
test find::parallel_find_middle             ... bench:   1,612,078 ns/iter (+/- 803,146)
test find::parallel_find_missing            ... bench:   5,449,556 ns/iter (+/- 5,694,417)

Mine modified without folder-checking:

test find::parallel_find_first              ... bench:     179,146 ns/iter (+/- 3,701)
test find::parallel_find_last               ... bench:   2,168,005 ns/iter (+/- 486,373)
test find::parallel_find_middle             ... bench:   1,364,573 ns/iter (+/- 280,549)
test find::parallel_find_missing            ... bench:   5,090,836 ns/iter (+/- 7,171,469)

Mine as-is with folder-checking:

test find::parallel_find_first              ... bench:      15,749 ns/iter (+/- 380)
test find::parallel_find_last               ... bench:   2,723,820 ns/iter (+/- 776,910)
test find::parallel_find_middle             ... bench:   1,672,928 ns/iter (+/- 690,176)
test find::parallel_find_missing            ... bench:   5,472,982 ns/iter (+/- 5,915,631)

@cuviper
Copy link
Member Author

cuviper commented Oct 29, 2016

I guess the searches in these benchmarks are just too simple, so the overhead of an extra check/branch is actually noticeable. I think that's probably still the right thing to do in general, but that's just intuition.

@nikomatsakis
Copy link
Member

@cuviper my thoughts:

  1. I prefer a name like find_any to indicate that the semantics diverge from ordinary find (doesn't necessarily find the first one). Similarly, I'd prefer another name for position.
    • This is a bit of a tough call but I feel like it should be a principle that if you change from iter to par_iter then it should stop compiling if the semantics are different. (Either because the name is different or because the signature has changed -- the latter is the case with fold.)
  2. I'm torn about the unsafe thing:
    • Of course less unsafe is good, but the *mut is pretty trivial and it feels like copying around larger values (e.g., a Vec or some such) might be a noticeable cost. I guess if I want to state that I had better construct a benchmark. =)
  3. I agree that loading the "am I done" counter each loop iteration seems better, not sure why I didn't write it that way to start.

@cuviper
Copy link
Member Author

cuviper commented Oct 29, 2016

  1. I don't really agree about naming, but I don't feel strongly if that's what you want. To me, the semantic difference is relatively minor, and unsurprising if one thinks about what parallel find implies.
    • Any suggestion for alt position? find_any is OK, but I don't like the sound of position_any.
  2. Vec is a bad example for your copy concern, because its data is boxed away unmoved. I get what you're saying though, but note that the number of copies is at most just the number of splits, something like O(log N). So I think this a minor overhead compared to the search itself, not worth unsafe hacking.
  3. It can go either way, but for more expensive searches I think we'll want to escape ASAP.

@cuviper
Copy link
Member Author

cuviper commented Oct 29, 2016

Rebased to solve the merge conflict.

@nikomatsakis
Copy link
Member

@cuviper

To me, the semantic difference is relatively minor, and unsurprising if one thinks about what parallel find implies.

Yeah, I can't decide how to weigh the convenience that things "just work" versus subtle changes in semantics. Part of my thought process was that it would be useful to also have a find_first that always found the first result -- it would basically keep a counter with the lowest matching index, and cut off searches as appropriate. I'd probably avoid calling that find because it's likely not what you want by default, since it's less efficient. Similarly, one could imagine position_{any,first}, though I agree the names are less good.

Vec is a bad example for your copy concern, because its data is boxed away unmoved.

I'd expect your code to perform as well on vecs, but I tossed it out there because I'd want to at least measure something larger than a word. =) I had planned to do some kind of benchmark using [u32; N] for various values of N (lack of const generics means that would require macros, I guess, but oh well). I also thought it'd be nice to have some benchmarks for things like "match occurs N% of the time".

I get what you're saying though, but note that the number of copies is at most just the number of splits, something like O(log N). So I think this a minor overhead compared to the search itself, not worth unsafe hacking.

Yeah, having slept on it a bit, I'm inclined to say let's just take your branch -- if we find the overhead is a problem for some example, we can move to unsafe code later. I do like having the iterators be largely safe code, as long as there isn't a strong performance case.

@cuviper
Copy link
Member Author

cuviper commented Oct 31, 2016

I think I'd prefer a Mutex<Option<T>> before dipping back into unsafe code. I've been bitten too many times by subtle concurrency memory-ordering issues (in other languages), so I'd avoid the risk, even though your unsafe code does look trivially correct.

@cuviper
Copy link
Member Author

cuviper commented Oct 31, 2016

(I'd still keep the relaxed atomic bool as a hint for early exits, so there wouldn't be a ton of mutex serialization.)

@cuviper
Copy link
Member Author

cuviper commented Nov 1, 2016

I went ahead and tried it, so we can find a better middle ground. :) I now have a Mutex around the shared result location, and the rest of the Consumer and Finder implementations are greatly simplified. I added one more "common" benchmark which looks for any x % 1000 == 999 in the random haystack.

test find::parallel_find_common             ... bench:      20,295 ns/iter (+/- 853)
test find::parallel_find_first              ... bench:      16,082 ns/iter (+/- 412)
test find::parallel_find_last               ... bench:   2,695,487 ns/iter (+/- 947,960)
test find::parallel_find_middle             ... bench:   1,642,146 ns/iter (+/- 631,507)
test find::parallel_find_missing            ... bench:   5,668,282 ns/iter (+/- 6,475,071)

@nikomatsakis
Copy link
Member

nikomatsakis commented Nov 1, 2016

@cuviper

I think I'd prefer a Mutex<Option> before dipping back into unsafe code.

To me it all comes down to benchmarks, but I suspect I am mildly more tolerant of unsafe code, particularly if it is simple. Using both a mutex and an atomic feels sad to me. =) That said, I do appreciate your wariness! Still, I'd probably lean towards your original version (or mine, if perf were an issue).

One thing about mutexes that's worth noting is that their performance varies dramatically on different operating systems-- what O/S were those measurements taken on?

(Also, you could eliminate the possibility of mutex contention by leveraging the atomic a bit better. e.g., use a seq-cst swap, as I did in my code, and then acquire the mutex only if you know that you will be the winner.)

@nikomatsakis
Copy link
Member

OK @cuviper I went ahead and whipped up some more benchmarks. In particular, I replicated your benchmarks but with data of size 1 (as you had it), 64, and 256. Here are the results, collected into a gist. I did 3 runs of each but just collected the results into a gist. Honestly, I haven't had time to even analyze them yet!

@nikomatsakis
Copy link
Member

OK, I think the TL;DR is that it doesn't make much difference what scheme we use. Here is some analysis to show you what I mean. =)

lunch-box. grep find::size256::parallel_find_common *.txt
killme.find-atomic.1.txt:test find::size256::parallel_find_common    ... bench:      40,772 ns/iter (+/- 2,295)
killme.find-atomic.2.txt:test find::size256::parallel_find_common    ... bench:      40,684 ns/iter (+/- 2,272)
killme.find-atomic.3.txt:test find::size256::parallel_find_common    ... bench:      40,860 ns/iter (+/- 2,644)
killme.find-mutex.1.txt:test find::size256::parallel_find_common    ... bench:      40,174 ns/iter (+/- 2,653)
killme.find-mutex.2.txt:test find::size256::parallel_find_common    ... bench:      40,075 ns/iter (+/- 2,149)
killme.find-mutex.3.txt:test find::size256::parallel_find_common    ... bench:      40,648 ns/iter (+/- 2,309)
killme.find-reduce.1.txt:test find::size256::parallel_find_common    ... bench:      39,504 ns/iter (+/- 2,108)
killme.find-reduce.2.txt:test find::size256::parallel_find_common    ... bench:      39,767 ns/iter (+/- 2,459)
killme.find-reduce.3.txt:test find::size256::parallel_find_common    ... bench:      40,136 ns/iter (+/- 2,875)
lunch-box. grep find::size256::parallel_find_first *.txt
killme.find-atomic.1.txt:test find::size256::parallel_find_first     ... bench:      44,298 ns/iter (+/- 3,264)
killme.find-atomic.2.txt:test find::size256::parallel_find_first     ... bench:      43,782 ns/iter (+/- 8,295)
killme.find-atomic.3.txt:test find::size256::parallel_find_first     ... bench:      44,442 ns/iter (+/- 2,801)
killme.find-mutex.1.txt:test find::size256::parallel_find_first     ... bench:      43,112 ns/iter (+/- 1,575)
killme.find-mutex.2.txt:test find::size256::parallel_find_first     ... bench:      43,669 ns/iter (+/- 2,234)
killme.find-mutex.3.txt:test find::size256::parallel_find_first     ... bench:      43,566 ns/iter (+/- 2,593)
killme.find-reduce.1.txt:test find::size256::parallel_find_first     ... bench:      43,555 ns/iter (+/- 2,509)
killme.find-reduce.2.txt:test find::size256::parallel_find_first     ... bench:      43,231 ns/iter (+/- 2,200)
killme.find-reduce.3.txt:test find::size256::parallel_find_first     ... bench:      43,945 ns/iter (+/- 2,632)
lunch-box. grep find::size256::parallel_find_last *.txt
killme.find-atomic.1.txt:test find::size256::parallel_find_last      ... bench:  44,260,038 ns/iter (+/- 10,718,065)
killme.find-atomic.2.txt:test find::size256::parallel_find_last      ... bench:  45,207,785 ns/iter (+/- 12,503,029)
killme.find-atomic.3.txt:test find::size256::parallel_find_last      ... bench:  44,537,781 ns/iter (+/- 14,834,277)
killme.find-mutex.1.txt:test find::size256::parallel_find_last      ... bench:  46,391,296 ns/iter (+/- 10,870,670)
killme.find-mutex.2.txt:test find::size256::parallel_find_last      ... bench:  46,068,558 ns/iter (+/- 14,485,740)
killme.find-mutex.3.txt:test find::size256::parallel_find_last      ... bench:  44,435,522 ns/iter (+/- 13,130,128)
killme.find-reduce.1.txt:test find::size256::parallel_find_last      ... bench:  44,068,219 ns/iter (+/- 13,201,860)
killme.find-reduce.2.txt:test find::size256::parallel_find_last      ... bench:  45,031,003 ns/iter (+/- 13,357,368)
killme.find-reduce.3.txt:test find::size256::parallel_find_last      ... bench:  44,743,122 ns/iter (+/- 11,760,751)
lunch-box. grep find::size256::parallel_find_missing *.txt
killme.find-atomic.1.txt:test find::size256::parallel_find_missing   ... bench:  53,399,440 ns/iter (+/- 158,111)
killme.find-atomic.2.txt:test find::size256::parallel_find_missing   ... bench:  52,843,421 ns/iter (+/- 134,126)
killme.find-atomic.3.txt:test find::size256::parallel_find_missing   ... bench:  52,847,156 ns/iter (+/- 273,653)
killme.find-mutex.1.txt:test find::size256::parallel_find_missing   ... bench:  52,768,235 ns/iter (+/- 165,680)
killme.find-mutex.2.txt:test find::size256::parallel_find_missing   ... bench:  52,778,307 ns/iter (+/- 176,499)
killme.find-mutex.3.txt:test find::size256::parallel_find_missing   ... bench:  52,770,511 ns/iter (+/- 161,761)
killme.find-reduce.1.txt:test find::size256::parallel_find_missing   ... bench:  52,822,863 ns/iter (+/- 87,574)
killme.find-reduce.2.txt:test find::size256::parallel_find_missing   ... bench:  52,792,112 ns/iter (+/- 174,477)
killme.find-reduce.3.txt:test find::size256::parallel_find_missing   ... bench:  52,764,203 ns/iter (+/- 128,669)

@nikomatsakis
Copy link
Member

They're so close I have to wonder if I screwed it up somehow. But I'm pretty sure I didn't. Anyway, the source is on my branch; you can switch the impl by changing the #[cfg].

@edre
Copy link
Contributor

edre commented Nov 1, 2016

How many cores are you using? The choice of atomic primitive may have different performance behaviors with higher/lower contention.

Anyway, if the performance is in the noise, I like the reduce implementation best, as the semantics are clearer to me and I think it would have the least contention with many cores.

@nikomatsakis
Copy link
Member

nikomatsakis commented Nov 1, 2016

@edre oh yeah should have clarified; this would be an 8x2-core CPU, I believe, running linux. I also agree that the reduce-based version is the best if perf is the same.

@nikomatsakis
Copy link
Member

@cuviper so I thought about it some more, and I think we should call it find_any. I would like to maintain the principle that if you use par iters, your code either does the same thing, or doesn't compile, modulo atomics and a few other nitpicky things, like float addition not being commutative.

Basically, it seems like switching to par_iter could otherwise introduce a very subtle bug. position is annoying but I'd go for position_any just for consistency and predictability (any_position is a bit more grammatical though).

One other thought I had is keeping find (and position) but "pre-deprecating" it, and referring people to find_any and (position_any) instead.

@cuviper
Copy link
Member Author

cuviper commented Nov 1, 2016

I suspect I am mildly more tolerant of unsafe code, particularly if it is simple.

You might well be more tolerant, but you're also more aware than most of the gray areas around unsafe code, Mr. Tootsie Pop. :)

I'm moderately OK with unsafe code for memory tricks, like I used to implement par_iter vector, but I'm really wary of hand-written synchronization.

Using both a mutex and an atomic feels sad to me. =)

I suppose, but it's with reason. The mutex is used to gate concurrent access, as mutexes do. The atomic is simply a global flag which doesn't need any synchronization since it's only one-way (false->true), and in a looser language I probably wouldn't even use an atomic construct.

One thing about mutexes that's worth noting is that their performance varies dramatically on different operating systems-- what O/S were those measurements taken on?

Linux - Fedora 24 x86_64. But the performance of this mutex doesn't really matter, as it's only ever called in the race to the finish. At worst it will only be called once in each rayon thread, and even that is uncontended because I only used try_lock.

(Also, you could eliminate the possibility of mutex contention by leveraging the atomic a bit better. e.g., use a seq-cst swap, as I did in my code, and then acquire the mutex only if you know that you will be the winner.)

This is the sort of thing try_lock already does, at least with Linux futexes.

OK, I think the TL;DR is that it doesn't make much difference what scheme we use.

Thanks for measuring! I'm not really surprised that these variations are in the noise -- the reduce version has only a handful of copies as I mentioned before, and the mutex version only ever even tries to lock once per thread. And your version is roughly just an unsafe-hand-rolled equivalent to my mutex version.

@cuviper
Copy link
Member Author

cuviper commented Nov 1, 2016

So the TODOs:

  • Revert to the reduce-based implementation.
  • Rename to find_any and position_any
  • Add pre-deprecated find and position so users get a clue.
  • Incorporate the expanded benchmark sizes.

Anything else? I'll work on this more tonight...

@nikomatsakis
Copy link
Member

@cuviper I think I'd leave off the expanded benchmark sizes; maybe incorporate just the macro. The main reason is that the larger sizes take forever to run, and they didn't prove particularly informative.

Thanks for measuring! I'm not really surprised that these variations are in the noise -- the reduce version has only a handful of copies as I mentioned before, and the mutex version only ever even tries to lock once per thread. And your version is roughly just an unsafe-hand-rolled equivalent to my mutex version.

Indeed. It makes sense, but always good to check.

@nikomatsakis
Copy link
Member

@cuviper (but yes that looks great!)

@nikomatsakis nikomatsakis merged commit 6ed25a6 into rayon-rs:master Nov 2, 2016
@nikomatsakis
Copy link
Member

Looks good to me! Nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants