Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify zip iterator and improve performance in certain cases #27415

Closed
wants to merge 4 commits into from

Conversation

haampie
Copy link
Contributor

@haampie haampie commented Jun 4, 2018

This is a separate pr built on top of #27386. It removes the Zip2 struct, since what used to be Zip1 is already returning a 1-tuple and hence is a base case (turns out @StefanKarpinski was right in the comments here: eaf5a95, but maybe only after Zip1 was updated.)

Further it fixes a bug in isdone where the tail of the zip iterator was never checked (https://github.com/JuliaLang/julia/blob/master/base/iterators.jl#L373). Should still add a test for this.

I've built this on top of #27386 to show that we now finally get fully optimized code in some cases -- note the lack of @inbounds:

$ ./julia
> using BenchmarkTools
> function f(v)
     s = 0
     for (i, w) = zip(LinearIndices(v), v)
       s += i * w
     end
     s
   end
> @benchmark f(v) setup = (v = rand(Int, 1000))

On current master:

  median time:      546.735 ns (0.00% GC)

On this branch:

  median time:      245.483 ns (0.00% GC)

This is now on par with writing for (i, w) = enumerate(v) and with the handwritten loop

@inbounds for i = 1 : length(v)
  s += i * v[i]
end

@StefanKarpinski
Copy link
Member

Very cool! Test failure in test/iterators looks legitimate though. It would be nice to be able to define enumerate(v) simply as zip(LinearIndices(v), v) and have that be maximally efficient.

@haampie
Copy link
Contributor Author

haampie commented Jun 4, 2018

Yeah, test failures are annoying, but potentially there's still a workaround without the machinery proposed in #27412.

For vectorization it seems crucial to call iterate first on both iterators, and only afterwards check the state. This corresponds to the case isdone(a) === isdone(b) === missing. Otherwise we have at least one stateful iterator, in which case early returns are required.

@haampie
Copy link
Contributor Author

haampie commented Jun 4, 2018

It seems this passes the iterators tests while retaining the performance in the example above :), hm, not really, apparently some last changes spoiled it.

Can we tell the compiler somehow the first branch in zip_iterate is almost always true?

@haampie
Copy link
Contributor Author

haampie commented Jun 4, 2018

Alright, things get vectorized again. This stuff is extremely fragile, so maybe I should add comments with warnings about spoiling vectorization.

It could be way more robust and simpler if we did not have to worry about stateful iterators.

edit: I stumbled upon an inference issue in the case of 3+ iterators -- let's see if it can be fixed.

@bramtayl
Copy link
Contributor

bramtayl commented Jun 5, 2018

n.b. #26765

@haampie
Copy link
Contributor Author

haampie commented Jun 5, 2018

Thanks for the pointer, I wasn't aware of that issue.

@haampie
Copy link
Contributor Author

haampie commented Jun 5, 2018

Some observations: https://gist.github.com/haampie/5dd6803b217709c4629d3eeb75c16258

Basically, we can do really well if we specialize for stateless iterators. Type inference seems perfect even with recursion up to 14 calls of iterate deep (cc @bramtayl); some daunting looking loops that have closed-form expressions get optimized away; and vectorization kicks in.

I cannot get all these great results to work with the current way stateful iterators are handled. If I find time I'll take a look at how we could use simple traits to work around this.

@haampie
Copy link
Contributor Author

haampie commented Jun 6, 2018

I think I might give up on this pr -- at least on the recursive structure. It's really more of an art than a science to get performance right.

The results in the comment above look promising, but in turns out it does not generate optimal code for

function vectorize_multiple_products(xs::Vector{T}, ys::Vector{T}, zs::Vector{T}) where {T}
    s = zero(T)
    for (x,y,z) = zip(xs, ys, zs)
        s += x * y * z
    end
    s
end

@benchmark vectorize_multiple_products(xs, ys, zs) = setup(xs = rand(1000); ys = rand(1000); zs = rand(1000);)

which is about 60% slower than the current implementation of zip; unless one interchanges the tail = iterate(z.tail, state[2]) and head === nothing && return nothing lines, but then it does not recognize it can simplify optimize_away_loop_with_closed_form. In the recursive setting it seems virtually impossible to fix all problems.

@haampie haampie closed this Jun 6, 2018
@haampie haampie deleted the simplify-zip branch June 6, 2018 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants