-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify zip iterator and improve performance in certain cases #27415
Conversation
Very cool! Test failure in |
Yeah, test failures are annoying, but potentially there's still a workaround without the machinery proposed in #27412. For vectorization it seems crucial to call |
It seems this passes the iterators tests Can we tell the compiler somehow the first branch in |
Alright, things get vectorized again. This stuff is extremely fragile, so maybe I should add comments with warnings about spoiling vectorization. It could be way more robust and simpler if we did not have to worry about stateful iterators. edit: I stumbled upon an inference issue in the case of 3+ iterators -- let's see if it can be fixed. |
Furthermore, reorder the instructions for the fast path to enable vectorization in some cases.
n.b. #26765 |
Thanks for the pointer, I wasn't aware of that issue. |
Some observations: https://gist.github.com/haampie/5dd6803b217709c4629d3eeb75c16258 Basically, we can do really well if we specialize for stateless iterators. Type inference seems perfect even with recursion up to 14 calls of I cannot get all these great results to work with the current way stateful iterators are handled. If I find time I'll take a look at how we could use simple traits to work around this. |
I think I might give up on this pr -- at least on the recursive structure. It's really more of an art than a science to get performance right. The results in the comment above look promising, but in turns out it does not generate optimal code for function vectorize_multiple_products(xs::Vector{T}, ys::Vector{T}, zs::Vector{T}) where {T}
s = zero(T)
for (x,y,z) = zip(xs, ys, zs)
s += x * y * z
end
s
end
@benchmark vectorize_multiple_products(xs, ys, zs) = setup(xs = rand(1000); ys = rand(1000); zs = rand(1000);) which is about 60% slower than the current implementation of zip; unless one interchanges the |
This is a separate pr built on top of #27386. It removes the
Zip2
struct, since what used to beZip1
is already returning a 1-tuple and hence is a base case (turns out @StefanKarpinski was right in the comments here: eaf5a95, but maybe only afterZip1
was updated.)Further it fixes a bug in
isdone
where the tail of the zip iterator was never checked (https://github.com/JuliaLang/julia/blob/master/base/iterators.jl#L373). Should still add a test for this.I've built this on top of #27386 to show that we now finally get fully optimized code in some cases -- note the lack of
@inbounds
:On current master:
On this branch:
This is now on par with writing
for (i, w) = enumerate(v)
and with the handwritten loop