Interaction between `addprocs` and `--heap-size-hint` #50673

MilesCranmer · 2023-07-26T03:40:49Z

I think this may be a feature request/issue report rather than a question, so I am posting here rather than on the discourse.

I am basically wondering: do Distributed.addprocs and the CLI flag --heap-size-hint interact with eachother, and, if so, how? If I specify --heap-size-hint to the head Julia process, and then dynamically create Julia processes with addprocs, how (if?) does it get split up among each of the processes created?

My current workaround is to pass --heap-size-hint to exeflags in each addprocs call, and divvy up the memory into smaller chunks, but this seems a bit hacky.

Just wondering about this because I've ran into various segfaults which I've never been able to track down (#47957). I realized that memory usage seems to explode when using multiple processes which are allocated dynamically, but the memory can be controlled by setting, e.g., --heap-size-hint when dynamically creating the workers.

The text was updated successfully, but these errors were encountered:

MilesCranmer · 2023-07-31T06:58:04Z

@gbaraldi do you know if your PR #50144 would help with this?

elextr · 2023-07-31T07:48:45Z

Let me just leave my observations from experience hacking A. N. Other (unnamed) language`s GC, on Linux.

On Linux its very difficult to determine how much memory is available due to its over committing behaviour and the unavailability of information about memory usage of everything else running on the machine. But that is the information the OOM killer uses to decide to act, and it acts on some heuristic of biggest and newest. So its hard for any single process GC to second guess OOM to decide how hard it should work to keep memory down to avoid being killed. User provided hints are the most reasonable method to tell the GC how hard to work, not let the GC try to calculate for itself, the Linux system just can't provide what it needs to do the calculation (or get a bigger machine and turn OOM off 😁).

Also the OOM killer looks at what to kill by cgroup, so if the workers are in another cgroup(s) from the parent, what gets killed may differ.

My current workaround is to pass --heap-size-hint to exeflags in each addprocs call, and divvy up the memory into smaller chunks, but this seems a bit hacky.

Far from hacky that would seem to be the right solution (specify manually), Julia can't know what the user is going to run on the workers and therefore what their memory needs are. Maybe the parent needs lots of memory and the workers will be heavily compute bound on small amounts of memory, or vice versa.

MilesCranmer · 2023-07-31T09:58:56Z

Thanks. I guess I'm just wondering what the default behavior is. Does the heap-size-hint propagate to workers, or does it get split up (from your message, my guess is no), or do the worker processes have no heap-size-hint? Furthermore, is this behavior different if you launch with -p {PROCS} versus if you allocate them dynamically with addprocs?

At the very least this seems like a documentation issue. I really have no idea what to expect my workers to do here...

For the record I don't really see this on a real workstation (though I do see memory blow up; it doesn't hit OOM errors yet). The more urgent issue is on GitHub actions, where my multiprocessing tests can sometimes segfault for inexplicable reasons. My guess is that it's an OOM issue.

elextr · 2023-07-31T10:27:39Z

from your message, my guess is no

I should have been clear, my searches found the number from --heap-size-hint is only ever used in gc.c. So its not passed to workers automatically AFAICT with -p or addprocs()

I realized that memory usage seems to explode when using multiple processes which are allocated dynamically, but the memory can be controlled by setting, e.g., --heap-size-hint when dynamically creating the workers.

This finding would seem to confirm that. So workers just behave as if no size hint exists unless addprocs() with heap-size-hint in exeflags is used.

ufechner7 · 2023-07-31T16:44:01Z

Well, but that is not good, right? I mean, that is not what a user would expect. So it should be improved.

oscardssmith · 2023-07-31T17:02:07Z

Agreed.

elextr · 2023-07-31T23:57:41Z

@ufechner7 @oscardssmith improvement is good, but what do you suggest is an improvement?

oscardssmith · 2023-08-01T00:37:15Z

IMO addprocs should probably use the heap limit unless otherwise overruled. (in a dream world they would share the same heap limit capacity but that's probably hard).

evetion · 2023-08-01T07:56:12Z

I've hit this too, addprocs don't automatically share much of the main process. Since 1.9 they at least share the package environment (#43270), but maybe they should share also (some) exeflags?

The minimum fix here is to update the documentation with a warning.

elextr · 2023-08-01T23:09:41Z

@oscardssmith personally I'm not sure making all heap-size-hints the same is really an improvement, but maybe its just my experience of commonly having large size asymmetries between the parent and workers. At least its simple so its easily documented, and so long as it can be overruled, and that is documented, then the actual default behaviour is not so important.

Or the simplest solution is to leave it as is and as @evetion says document the behaviour and note the use of exeflags.

oscardssmith · 2023-08-02T01:58:09Z

to me it seems like using the same limit is the right default (since that's the guess we can make). We should also definitely document the way to overrule.

JoelNelson · 2023-08-07T21:29:56Z

@MilesCranmer you mention the workaround for adding heap size hint as an exeflag when doing addprocs. It doesn't seem to have affect and I'm experiencing a similar issue. Is this the syntax you used?

addprocs(10; exeflags=`--heap-size-limit=2G`)

MilesCranmer · 2023-08-07T22:01:32Z

@JoelNelson it should be --heap-size-hint rather than limit. Also make sure to also set the heap-size-hint on the head process as well so it doesn't also blow up memory. But yeah that fixed the issue for me on Julia 1.9.2 (maybe also try setting it smaller to see whether it has an effect or not?)

My full call was (on a Rocky Linux Slurm cluster)

julia> using Distributed, ClusterManagers

julia> procs = addprocs_slurm(8 * 64; exeflags=`--threads=1 --heap-size-hint=1G`)

which fixed my memory from exploding

JoelNelson · 2023-08-08T13:41:56Z

@MilesCranmer thanks!! I was doing dev inside a client VDI and was re-typing out on my normal PC and accidentally used limit instead of hint. However, I trimmed down the hint to 2G from 10G and that did the trick!

Appreciate the quick response!

It seems this is causing macos to hang because the shown free memory is generally very small. xref: #50673

MilesCranmer · 2023-12-20T05:43:03Z

Is anybody working on this? It seems like a major issue with garbage collection in the distributed interface. I run into OOM errors all the time from this unless I manually set up the correct --heap-size-hint.

vtjnash · 2024-02-11T01:19:39Z

The new GC pacer algorithm is supposed to help with this, as it dynamically allocates more memory to processes that are getting more CPU time, and forces processes that are getting less CPU (e.g. because they are starting to bottleneck on memory and swap) to use less memory

MilesCranmer changed the title ~~Interaction between addprocs and heap-size-hint~~ Interaction between addprocs and --heap-size-hint Jul 26, 2023

This was referenced Jul 26, 2023

Repeated segfaults on Windows integration tests #47957

Open

OOM despite --heap-size-hint #50658

Closed

Garbage collection too passive on worker processes MilesCranmer/SymbolicRegression.jl#237

Closed

gbaraldi mentioned this issue Aug 1, 2023

CI failures for ubuntu on 1.10 and nightly, probably out-of-memory issues oscar-system/Oscar.jl#2441

Closed

brenhinkeller added the parallelism Parallel or distributed computation label Aug 3, 2023

evetion mentioned this issue Aug 8, 2023

Clarify that command-line switches need to be manually set in addprocs #50843

Merged

brenhinkeller added the docs This change adds or pertains to documentation label Aug 8, 2023

gbaraldi mentioned this issue Aug 15, 2023

Change heap-size-hint in test processes to total memory #50922

Merged

aviatesk pushed a commit that referenced this issue Aug 16, 2023

Change heap-size-hint in test processes to total memory (#50922)

883c19b

It seems this is causing macos to hang because the shown free memory is generally very small. xref: #50673

MilesCranmer mentioned this issue Oct 13, 2023

Allow --heap-size-hint option to be passed JuliaPy/pyjulia#537

Open

MilesCranmer mentioned this issue Dec 20, 2023

[BUG]: Possible memory leakage & best practices for memory scaling? MilesCranmer/PySR#490

Closed

MilesCranmer mentioned this issue Dec 22, 2023

Automatically set heap size hint on workers MilesCranmer/SymbolicRegression.jl#270

Merged

3 tasks

vtjnash closed this as completed Feb 11, 2024

MilesCranmer mentioned this issue Aug 29, 2024

Make heap size hint available as an env variable #55631

Merged

a2ray mentioned this issue Oct 8, 2024

Investigate GC on long runs GeoscienceAustralia/HiQGA.jl#75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interaction between `addprocs` and `--heap-size-hint` #50673

Interaction between `addprocs` and `--heap-size-hint` #50673

MilesCranmer commented Jul 26, 2023 •

edited

Loading

MilesCranmer commented Jul 31, 2023

elextr commented Jul 31, 2023

MilesCranmer commented Jul 31, 2023 •

edited

Loading

elextr commented Jul 31, 2023

ufechner7 commented Jul 31, 2023

oscardssmith commented Jul 31, 2023

elextr commented Jul 31, 2023

oscardssmith commented Aug 1, 2023

evetion commented Aug 1, 2023

elextr commented Aug 1, 2023

oscardssmith commented Aug 2, 2023

JoelNelson commented Aug 7, 2023 •

edited

Loading

MilesCranmer commented Aug 7, 2023 •

edited

Loading

JoelNelson commented Aug 8, 2023

MilesCranmer commented Dec 20, 2023

vtjnash commented Feb 11, 2024

Interaction between addprocs and --heap-size-hint #50673

Interaction between addprocs and --heap-size-hint #50673

Comments

MilesCranmer commented Jul 26, 2023 • edited Loading

MilesCranmer commented Jul 31, 2023

elextr commented Jul 31, 2023

MilesCranmer commented Jul 31, 2023 • edited Loading

elextr commented Jul 31, 2023

ufechner7 commented Jul 31, 2023

oscardssmith commented Jul 31, 2023

elextr commented Jul 31, 2023

oscardssmith commented Aug 1, 2023

evetion commented Aug 1, 2023

elextr commented Aug 1, 2023

oscardssmith commented Aug 2, 2023

JoelNelson commented Aug 7, 2023 • edited Loading

MilesCranmer commented Aug 7, 2023 • edited Loading

JoelNelson commented Aug 8, 2023

MilesCranmer commented Dec 20, 2023

vtjnash commented Feb 11, 2024

Interaction between `addprocs` and `--heap-size-hint` #50673

Interaction between `addprocs` and `--heap-size-hint` #50673

MilesCranmer commented Jul 26, 2023 •

edited

Loading

MilesCranmer commented Jul 31, 2023 •

edited

Loading

JoelNelson commented Aug 7, 2023 •

edited

Loading

MilesCranmer commented Aug 7, 2023 •

edited

Loading