Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources exhausted errors are confusing return the biggest memory consumers. #11523

Closed
wiedld opened this issue Jul 17, 2024 · 5 comments · Fixed by #11949
Closed

Resources exhausted errors are confusing return the biggest memory consumers. #11523

wiedld opened this issue Jul 17, 2024 · 5 comments · Fixed by #11949
Assignees
Labels
enhancement New feature or request

Comments

@wiedld
Copy link
Contributor

wiedld commented Jul 17, 2024

Is your feature request related to a problem or challenge?

When asking for more memory via the memory reservation, the error returned from the underlying memory pool focuses on that specific request. As a result, when debugging a resource exhausted error, we get an error message that looks something like:

Failed to allocate additional 795696944 bytes for RepartitionExec[1]
with 3182787776 bytes already allocated - 
maximum available is 63296515

This^^ error is about what next incremental request failed to get more memory, and not about what is using the most memory. As a result, additional dev time has to be spent to (a) at best, track down the actual high memory consumer and (b) at worst, wasted time chasing the wrong memory consumer.

Describe the solution you'd like

One possible solution is to have the error message returned the Top K memory consumers.

Describe alternatives you've considered

No response

Additional context

No response

@wiedld wiedld added the enhancement New feature or request label Jul 17, 2024
@alamb alamb changed the title Resources exhuasted errors should return the biggest memory consumers. Resources exhuasted errors are confusing return the biggest memory consumers. Jul 17, 2024
@alamb
Copy link
Contributor

alamb commented Jul 17, 2024

I agree this message is quite confusing

There is a related idea here I think #6934

I think @westonpace mentioned something similar when they were debugging OOM issues too

@comphead
Copy link
Contributor

The message is quite confusing and we can probably make it better, mentioning requested, used, allocated and remaining mem. But I'm not sure if there is an easy way to what takes the mem, as memory reservation has no idea about the caller just the mem pool name.

@wiedld
Copy link
Contributor Author

wiedld commented Jul 17, 2024

But I'm not sure if there is an easy way to what takes the mem, as memory reservation has no idea about the caller just the mem pool name.

Well timed comment @comphead . I was just linking a previous (closed) PR which started down the path of tracking memory consumers (for the purposes of debug dumping). I'm not proposing that as the solution per se, rather giving us an idea of prior work.

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

I believe @wiedld plans to look at this one later this week. We have seen it with our customers and it is quite confusing.

@alamb
Copy link
Contributor

alamb commented Jul 31, 2024

@wiedld has an initial PR to add TrackedMemoryPool in #11665

Here is my proposal for what remains to close this issue

  • Improving the default error message
  • Using the TrackConsumersPool as the default memory pool

Thoughts on error messages

Message today:

Failed to allocate additional {} bytes for {} with {} bytes already allocated - maximum available is {}

@wiedld 's proposal on https://github.com/apache/datafusion/pull/11665/files#r1693465283

Failed to allocate additional {} bytes for {} with {} bytes already allocated for this reservation - {} bytes remain available for the total pool

I think the new proposal is better as it is clearer what is going on

Thoughts on changing the default pool

The the default is set here:

/// This defaults to using [`GreedyMemoryPool`]

I believe we should change the default pool in DataFusion to be a TrackedMemoryPool<GreedyMemoryPool> otherwise the error messages when a memory limit will continue to have the issues that are described in this tickets.

The only potential issue with changing the default is that the tracking has some additional runtime overhead -- therefore we should run benchmark tests to ensure there is no performance regression.

Also, a nice part of @wiedld 's TrackedMemoryPool design is that someone finds the overhead to be too large, they can change back to the existing GreedyMemoryPool

@wiedld wiedld changed the title Resources exhuasted errors are confusing return the biggest memory consumers. Resources exhausted errors are confusing return the biggest memory consumers. Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants