gc weirdness #8055

GunnarFarneback · 2014-08-19T13:33:46Z

This is going to be a terrible bug report and should possibly have gone on a mailing list, but the effect is interesting. Skip the background section if you want to get to the meat of the issue.

Background

The code in question is about 1000 lines, ported from Matlab. After fixing some early problems with type instability and globals, the speed went from 10 times slower to somewhat faster than Matlab. Further optimizations, mostly temporary array avoidance and devectorization brought it to 4 times faster than Matlab. Profiling and @time indicates that much time is still spent on garbage collection. Using --track-allocation from #7464 (great tool!) I identified the code line using most memory, and rewriting it did in fact reduce the amount of used memory. However, this caused a 10% slowdown overall and substantially more time spent in gc. Unfortunately the code is entirely proprietary and considering the global nature of gc I suspect that the problem will just go away if I try to minimize it. Thus this very vague description.

Issue

Before optimization (jit warmup taken care of):

julia> srand(13); @time y=do_stuff(x);
elapsed time: 22.071831352 seconds (12518213896 bytes allocated, 34.13% gc time)

In order to reduce memory consumption the following construction

    for i = 1:100
        [...]
        f = [x.v[t] for t in x.p]
        [...]
    end

is changed to

    f = Array(Float64, length(x.p))
    for i = 1:100
        [...]
        for k = 1:length(x.p)
            @inbounds f[k] = x.v[x.p[k]]
        end
        [...]
    end

x is an instance of a mutable type where x.v is a Vector{Float64} and x.p is a Matrix{Int}, in this case of sizes 1436 and 818560,2 respectively. After this optimization the memory consumption does go down but total time and fraction of time spent in gc both increase substantially

julia> srand(13); @time y=do_stuff(x);
elapsed time: 24.26151788 seconds (11221611688 bytes allocated, 41.13% gc time)

This is reproducible with only small variations in timings and gc fractions.

Any suggestions of what is going on? Is it reasonable for the gc to behave in this way? Is there some effective way for me to dig deeper into the problem?

julia> versioninfo()
Julia Version 0.3.0-rc1+288
Commit 343951a (2014-07-31 14:50 UTC)
Platform Info:
  System: Windows (i686-w64-mingw32)
  CPU: Intel(R) Xeon(R) CPU E31270 @ 3.40GHz
  WORD_SIZE: 32
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

The text was updated successfully, but these errors were encountered:

vchuravy · 2014-08-19T14:57:09Z

I have no insight to give here, but it would be interesting if the behaviour goes away if you use #5227

Jutho · 2014-08-19T18:10:34Z

It is strange that your replacement only reduces the memory allocation by such a tiny fraction. The second code should not allocate anything in these particular lines, whereas the first one clearly does.

What happens to this comparison if you comment out all the code corresponding to [...] and just run the loop with the lines you show, for a given instance of x.v and x.p. There should be no memory allocation in the loop at at all for the second case.

Keno · 2014-08-19T18:12:50Z

Also make sure that your loop is well typed (use code_typed). If it isn't, then preallocating doesn't help much.

GunnarFarneback · 2014-08-19T18:55:54Z

Maybe I should have been clearer that the timing and memory consumption is for running the whole program and the tiny part I've shown is the locally most memory consuming line. The reduction of bytes allocated matches the measurements with --track-allocation and is a healthy 10% lower memory consumption for the whole program. Based on earlier optimizations I've made on that code I had expected total time and gc time to follow suit but here they go in totally the opposite way.

I haven't really learnt how to read the output of code_typed but the TypeCheck package didn't come up with any complaints.

Jutho · 2014-08-19T19:07:36Z

On 19 Aug 2014, at 20:56, Gunnar Farnebäck [email protected] wrote:

Maybe I should have been clearer that the timing and memory consumption is for running the whole program and the tiny part I've shown is the locally most memory consuming line. The reduction of bytes allocated matches the measurements with --track-allocation and is a healthy 10% lower memory consumption for the whole program. Based on earlier optimizations I've made on that code I had expected total time and gc time to follow suit but here they go in totally the opposite way.

I understood this, but it is correct that my remark regarding the ‘only 10%’ is meaningless given that I don’t know the rest of the code.

Hence my suggestion to just run the code with everything corresponding to […] commented out, and just checking the effect of these lines. This should be possible for a given initial x, you would just be running 100 iterations of filling f with the same data.

Given that […] corresponds to a 1000 more lines of code, could it be that these lines contain something that kills the type stability. For example, you could be using the variable name k at some point in your code to represent something different than an Int. This would kill the type stability for k (I think) and result in a lot of overhead for the loop variable of the second code, which is not present in the first code.

timholy · 2014-08-19T20:20:49Z

I'd agree with Keno that a type problem is the most likely explanation. TypeCheck is limited to functions with declared types, so there are a lot of loop variable type problems it misses. code_typed is more laborious but more reliable. One option is to split the critical part out into its own function (even just temporarily) so there's less hunting to do.

Glad you like --track-allocation!

GunnarFarneback · 2014-08-20T11:47:01Z

A few updates:

Inspection of code_typed found a type problem elsewhere in the same function but fixing that had no impact on this issue.

Switching to 0.3.0-rc4 (still Windows 32 bit) made no difference to this behavior.

There is something platform specific to this, however. With 0.3.0-rc4 on Linux 64 bit the issue can not be seen. There the reduction of bytes allocated is accompanied by a reduction of total time and of gc fraction.

GunnarFarneback · 2014-08-20T12:10:05Z

Windows 64 bit also behaves well, so it seems to be a 32 bit issue.

GunnarFarneback · 2014-09-01T10:43:07Z

Linux 32 bit misbehaves too, so this definitely looks like a 32 bit issue.

Since nobody else will have any chance to reproduce it I'll close this issue. I can pick it up again if there's something specific to try that doesn't require too much effort.

GunnarFarneback closed this as completed Sep 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc weirdness #8055

gc weirdness #8055

GunnarFarneback commented Aug 19, 2014

vchuravy commented Aug 19, 2014

Jutho commented Aug 19, 2014

Keno commented Aug 19, 2014

GunnarFarneback commented Aug 19, 2014

Jutho commented Aug 19, 2014

timholy commented Aug 19, 2014

GunnarFarneback commented Aug 20, 2014

GunnarFarneback commented Aug 20, 2014

GunnarFarneback commented Sep 1, 2014

gc weirdness #8055

gc weirdness #8055

Comments

GunnarFarneback commented Aug 19, 2014

Background

Issue

vchuravy commented Aug 19, 2014

Jutho commented Aug 19, 2014

Keno commented Aug 19, 2014

GunnarFarneback commented Aug 19, 2014

Jutho commented Aug 19, 2014

timholy commented Aug 19, 2014

GunnarFarneback commented Aug 20, 2014

GunnarFarneback commented Aug 20, 2014

GunnarFarneback commented Sep 1, 2014