-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interaction between addprocs
and --heap-size-hint
#50673
Comments
addprocs
and heap-size-hint
addprocs
and --heap-size-hint
Let me just leave my observations from experience hacking A. N. Other (unnamed) language`s GC, on Linux. On Linux its very difficult to determine how much memory is available due to its over committing behaviour and the unavailability of information about memory usage of everything else running on the machine. But that is the information the OOM killer uses to decide to act, and it acts on some heuristic of biggest and newest. So its hard for any single process GC to second guess OOM to decide how hard it should work to keep memory down to avoid being killed. User provided hints are the most reasonable method to tell the GC how hard to work, not let the GC try to calculate for itself, the Linux system just can't provide what it needs to do the calculation (or get a bigger machine and turn OOM off 😁). Also the OOM killer looks at what to kill by cgroup, so if the workers are in another cgroup(s) from the parent, what gets killed may differ.
Far from hacky that would seem to be the right solution (specify manually), Julia can't know what the user is going to run on the workers and therefore what their memory needs are. Maybe the parent needs lots of memory and the workers will be heavily compute bound on small amounts of memory, or vice versa. |
Thanks. I guess I'm just wondering what the default behavior is. Does the At the very least this seems like a documentation issue. I really have no idea what to expect my workers to do here... For the record I don't really see this on a real workstation (though I do see memory blow up; it doesn't hit OOM errors yet). The more urgent issue is on GitHub actions, where my multiprocessing tests can sometimes segfault for inexplicable reasons. My guess is that it's an OOM issue. |
I should have been clear, my searches found the number from
This finding would seem to confirm that. So workers just behave as if no size hint exists unless |
Well, but that is not good, right? I mean, that is not what a user would expect. So it should be improved. |
Agreed. |
@ufechner7 @oscardssmith improvement is good, but what do you suggest is an improvement? |
IMO addprocs should probably use the heap limit unless otherwise overruled. (in a dream world they would share the same heap limit capacity but that's probably hard). |
I've hit this too, addprocs don't automatically share much of the main process. Since 1.9 they at least share the package environment (#43270), but maybe they should share also (some) exeflags? The minimum fix here is to update the documentation with a warning. |
@oscardssmith personally I'm not sure making all Or the simplest solution is to leave it as is and as @evetion says document the behaviour and note the use of |
to me it seems like using the same limit is the right default (since that's the guess we can make). We should also definitely document the way to overrule. |
@MilesCranmer you mention the workaround for adding heap size hint as an exeflag when doing addprocs. It doesn't seem to have affect and I'm experiencing a similar issue. Is this the syntax you used?
|
@JoelNelson it should be My full call was (on a Rocky Linux Slurm cluster) julia> using Distributed, ClusterManagers
julia> procs = addprocs_slurm(8 * 64; exeflags=`--threads=1 --heap-size-hint=1G`) which fixed my memory from exploding |
@MilesCranmer thanks!! I was doing dev inside a client VDI and was re-typing out on my normal PC and accidentally used limit instead of hint. However, I trimmed down the hint to 2G from 10G and that did the trick! Appreciate the quick response! |
It seems this is causing macos to hang because the shown free memory is generally very small. xref: #50673
Is anybody working on this? It seems like a major issue with garbage collection in the distributed interface. I run into OOM errors all the time from this unless I manually set up the correct |
The new GC pacer algorithm is supposed to help with this, as it dynamically allocates more memory to processes that are getting more CPU time, and forces processes that are getting less CPU (e.g. because they are starting to bottleneck on memory and swap) to use less memory |
I think this may be a feature request/issue report rather than a question, so I am posting here rather than on the discourse.
I am basically wondering: do
Distributed.addprocs
and the CLI flag--heap-size-hint
interact with eachother, and, if so, how? If I specify--heap-size-hint
to the head Julia process, and then dynamically create Julia processes withaddprocs
, how (if?) does it get split up among each of the processes created?My current workaround is to pass
--heap-size-hint
toexeflags
in eachaddprocs
call, and divvy up the memory into smaller chunks, but this seems a bit hacky.Just wondering about this because I've ran into various segfaults which I've never been able to track down (#47957). I realized that memory usage seems to explode when using multiple processes which are allocated dynamically, but the memory can be controlled by setting, e.g.,
--heap-size-hint
when dynamically creating the workers.The text was updated successfully, but these errors were encountered: