Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading missings is twice as slow as reading values #129

Closed
cstjean opened this issue Dec 19, 2019 · 2 comments
Closed

Reading missings is twice as slow as reading values #129

cstjean opened this issue Dec 19, 2019 · 2 comments

Comments

@cstjean
Copy link
Contributor

cstjean commented Dec 19, 2019

julia> using DataFrames, Feather, BenchmarkTools

julia> N = 100_000_000;

julia> df1 = DataFrame(x=Union{Float32, Missing}[missing for _ in 1:N]);

julia> df2 = DataFrame(x=Union{Float32, Missing}[1.1 for _ in 1:N]);

julia> Feather.write("test1.feather", df1);

julia> Feather.write("test2.feather", df2);

julia> @btime Feather.materialize("test1.feather");
  1.028 s (436 allocations: 953.69 MiB)

julia> @btime Feather.materialize("test2.feather");
  435.714 ms (421 allocations: 762.96 MiB)
@cstjean
Copy link
Contributor Author

cstjean commented Dec 19, 2019

julia> typeof(Feather.materialize("test2.feather").x)
Array{Float32,1}

I'd forgotten that Feather forgets about missings when there aren't any. That explains it...

Reading a 50% mix is 30% slower than reading all-missing:

julia> df5 = DataFrame(x=Union{Float32, Missing}[rand()<0.5 ? missing : 1.1 for _ in 1:N]);

julia> Feather.write("test5.feather", df5);

julia> @btime Feather.materialize("test5.feather");
  1.434 s (436 allocations: 953.69 MiB)

However that could be explained by poor branch prediction, I suppose? Maybe there isn't anything concrete to be done, I know your code is already highly optimized.

@ExpandingMan
Copy link
Collaborator

ExpandingMan commented Dec 19, 2019

There is a lot more overhead for reading and writing arrays with missings and arrays without. This is just because of how the arrow format works.

I wouldn't say anything here is "highly optimized", but I have done lots of basic performance sanity checks (for reading at least). Reading arrays without missings is extremely simple, and is therefore pretty much guaranteed to be maximally efficient. Reading arrays with missings is a lot more complicated, so it's much harder for me to state with any confidence whether it's close to saturating the theoretical upper limit on performance.

I'm not entirely sure why reading all missings is faster, but it may have something to do with the Julia type system (since the eltype here is Union{Missing,T}).

Of course, I'd always be happy to improve performance if possible, specific suggestions and PR's are of course welcome. That said, reading arrays with missings will never be as fast as reading those without, so I don't actually see an issue here. Feel free to re-open this if there is a specific performance problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants