Reading missings is twice as slow as reading values #129

cstjean · 2019-12-19T17:14:48Z

julia> using DataFrames, Feather, BenchmarkTools

julia> N = 100_000_000;

julia> df1 = DataFrame(x=Union{Float32, Missing}[missing for _ in 1:N]);

julia> df2 = DataFrame(x=Union{Float32, Missing}[1.1 for _ in 1:N]);

julia> Feather.write("test1.feather", df1);

julia> Feather.write("test2.feather", df2);

julia> @btime Feather.materialize("test1.feather");
  1.028 s (436 allocations: 953.69 MiB)

julia> @btime Feather.materialize("test2.feather");
  435.714 ms (421 allocations: 762.96 MiB)

cstjean · 2019-12-19T17:47:40Z

julia> typeof(Feather.materialize("test2.feather").x)
Array{Float32,1}

I'd forgotten that Feather forgets about missings when there aren't any. That explains it...

Reading a 50% mix is 30% slower than reading all-missing:

julia> df5 = DataFrame(x=Union{Float32, Missing}[rand()<0.5 ? missing : 1.1 for _ in 1:N]);

julia> Feather.write("test5.feather", df5);

julia> @btime Feather.materialize("test5.feather");
  1.434 s (436 allocations: 953.69 MiB)

However that could be explained by poor branch prediction, I suppose? Maybe there isn't anything concrete to be done, I know your code is already highly optimized.

ExpandingMan · 2019-12-19T17:55:31Z

There is a lot more overhead for reading and writing arrays with missings and arrays without. This is just because of how the arrow format works.

I wouldn't say anything here is "highly optimized", but I have done lots of basic performance sanity checks (for reading at least). Reading arrays without missings is extremely simple, and is therefore pretty much guaranteed to be maximally efficient. Reading arrays with missings is a lot more complicated, so it's much harder for me to state with any confidence whether it's close to saturating the theoretical upper limit on performance.

I'm not entirely sure why reading all missings is faster, but it may have something to do with the Julia type system (since the eltype here is Union{Missing,T}).

Of course, I'd always be happy to improve performance if possible, specific suggestions and PR's are of course welcome. That said, reading arrays with missings will never be as fast as reading those without, so I don't actually see an issue here. Feel free to re-open this if there is a specific performance problem here.

ExpandingMan closed this as completed Dec 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading missings is twice as slow as reading values #129

Reading missings is twice as slow as reading values #129

cstjean commented Dec 19, 2019

cstjean commented Dec 19, 2019 •

edited

Loading

ExpandingMan commented Dec 19, 2019 •

edited

Loading

Reading missings is twice as slow as reading values #129

Reading missings is twice as slow as reading values #129

Comments

cstjean commented Dec 19, 2019

cstjean commented Dec 19, 2019 • edited Loading

ExpandingMan commented Dec 19, 2019 • edited Loading

cstjean commented Dec 19, 2019 •

edited

Loading

ExpandingMan commented Dec 19, 2019 •

edited

Loading