-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add find
, position
, any
, and all
#129
Conversation
I've included some micro-benchmarks here, and made some comparisons with #128. The biggest differentiator appears to be the choice whether to check for completion within the folder loop, so I'll show each one both ways. This is running on Fedora 24 on an i7-2600 (4 cores plus HT). Niko's as-is, without folder checking:
Niko's modified with folder-checking:
Mine modified without folder-checking:
Mine as-is with folder-checking:
|
I guess the searches in these benchmarks are just too simple, so the overhead of an extra check/branch is actually noticeable. I think that's probably still the right thing to do in general, but that's just intuition. |
@cuviper my thoughts:
|
|
Rebased to solve the merge conflict. |
Yeah, I can't decide how to weigh the convenience that things "just work" versus subtle changes in semantics. Part of my thought process was that it would be useful to also have a
I'd expect your code to perform as well on vecs, but I tossed it out there because I'd want to at least measure something larger than a word. =) I had planned to do some kind of benchmark using
Yeah, having slept on it a bit, I'm inclined to say let's just take your branch -- if we find the overhead is a problem for some example, we can move to unsafe code later. I do like having the iterators be largely safe code, as long as there isn't a strong performance case. |
I think I'd prefer a |
(I'd still keep the relaxed atomic bool as a hint for early exits, so there wouldn't be a ton of mutex serialization.) |
I went ahead and tried it, so we can find a better middle ground. :) I now have a
|
To me it all comes down to benchmarks, but I suspect I am mildly more tolerant of unsafe code, particularly if it is simple. Using both a mutex and an atomic feels sad to me. =) That said, I do appreciate your wariness! Still, I'd probably lean towards your original version (or mine, if perf were an issue). One thing about mutexes that's worth noting is that their performance varies dramatically on different operating systems-- what O/S were those measurements taken on? (Also, you could eliminate the possibility of mutex contention by leveraging the atomic a bit better. e.g., use a seq-cst |
OK @cuviper I went ahead and whipped up some more benchmarks. In particular, I replicated your benchmarks but with data of size 1 (as you had it), 64, and 256. Here are the results, collected into a gist. I did 3 runs of each but just collected the results into a gist. Honestly, I haven't had time to even analyze them yet! |
OK, I think the TL;DR is that it doesn't make much difference what scheme we use. Here is some analysis to show you what I mean. =)
|
They're so close I have to wonder if I screwed it up somehow. But I'm pretty sure I didn't. Anyway, the source is on my branch; you can switch the impl by changing the |
How many cores are you using? The choice of atomic primitive may have different performance behaviors with higher/lower contention. Anyway, if the performance is in the noise, I like the reduce implementation best, as the semantics are clearer to me and I think it would have the least contention with many cores. |
@edre oh yeah should have clarified; this would be an 8x2-core CPU, I believe, running linux. I also agree that the reduce-based version is the best if perf is the same. |
@cuviper so I thought about it some more, and I think we should call it Basically, it seems like switching to One other thought I had is keeping |
You might well be more tolerant, but you're also more aware than most of the gray areas around unsafe code, Mr. Tootsie Pop. :) I'm moderately OK with unsafe code for memory tricks, like I used to implement par_iter vector, but I'm really wary of hand-written synchronization.
I suppose, but it's with reason. The mutex is used to gate concurrent access, as mutexes do. The atomic is simply a global flag which doesn't need any synchronization since it's only one-way (false->true), and in a looser language I probably wouldn't even use an atomic construct.
Linux - Fedora 24 x86_64. But the performance of this mutex doesn't really matter, as it's only ever called in the race to the finish. At worst it will only be called once in each rayon thread, and even that is uncontended because I only used
This is the sort of thing
Thanks for measuring! I'm not really surprised that these variations are in the noise -- the reduce version has only a handful of copies as I mentioned before, and the mutex version only ever even tries to lock once per thread. And your version is roughly just an unsafe-hand-rolled equivalent to my mutex version. |
So the TODOs:
Anything else? I'll work on this more tonight... |
@cuviper I think I'd leave off the expanded benchmark sizes; maybe incorporate just the macro. The main reason is that the larger sizes take forever to run, and they didn't prove particularly informative.
Indeed. It makes sense, but always good to check. |
@cuviper (but yes that looks great!) |
Looks good to me! Nice. |
This is an independent implementation from #128, just happened to be developed around the same time. I think the lack of
unsafe
is an advantage here, but performance needs to be compared.