-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TrackedMemoryPool
with better error messages on exhaustion
#11665
Conversation
…ith top K of consumers
…is actually returning
// Test: reports if new reservation causes error | ||
// using the previous set size for other consumers | ||
let mut r5 = MemoryConsumer::new("r5").register(&pool); | ||
let expected = "Failed to allocate additional 150 bytes for r5 with 0 bytes already allocated - maximum available is 5. The top memory consumers (across reservations) are: r1 consumed 50 bytes, r3 consumed 20 bytes, r2 consumed 15 bytes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the proposed change in a follow PR, the final error message would read:
Failed to allocate additional 150 bytes for r5 with 0 bytes already allocated for this reservation - 5 bytes remain available for the total pool. The top memory consumers (across reservations) are: r1 consumed 50 bytes, r3 consumed 20 bytes, r2 consumed 15 bytes
Resources exhausted with top memory consumers (across reservations) are: r1 consumed 50 bytes, r3 consumed 20 bytes, r2 consumed 15 bytes. Error: Failed to allocate additional 150 bytes for r5 with 0 bytes already allocated for this reservation - 5 bytes remain available for the total pool
…different reservations
…mer with the same name, but different hash
/// Constructs a resources error based upon the individual [`MemoryReservation`]. | ||
/// | ||
/// The error references the `bytes already allocated` for the reservation, | ||
/// and not the total within the collective [`MemoryPool`], | ||
/// nor the total across multiple reservations with the same [`MemoryConsumer`]. | ||
#[inline(always)] | ||
fn insufficient_capacity_err( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow up PR, I would like to iterate on this error message.
Take the original/current:
Failed to allocate additional {} bytes for {} with {} bytes already allocated - maximum available is {}
And change into something like:
Failed to allocate additional {} bytes for {} with {} bytes already allocated for this reservation - {} bytes remain available for the total pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This^^ suggestion is because this error message is inherently about only the reservation itself. Each reservation can request bytes from the pool -- but the pool itself only incr/decrements totals for the pool. It does NOT track each reservation, nor if multiple reservations have the same consumer.
As a result, the simplest solution is to: (a) update this error message, and (b) continue to rely upon each reservation to resize itself to zero upon drop (as it does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of updating the error message in a follow on PR
|
||
// Test: see error message when no consumers recorded yet | ||
let mut r0 = MemoryConsumer::new(same_name).register(&pool); | ||
let expected = "Failed to allocate additional 150 bytes for foo with 0 bytes already allocated - maximum available is 100. The top memory consumers (across reservations) are: foo consumed 0 bytes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the proposed change in a follow PR, the final error message would read:
Failed to allocate additional 150 bytes for foo with 0 bytes already allocated for this reservation - 100 bytes remain available for the total pool. The top memory consumers (across reservations) are: foo consumed 0 bytes
Resources exhausted with top memory consumers (across reservations) are: foo consumed 0 bytes. Error: Failed to allocate additional 150 bytes for foo with 0 bytes already allocated for this reservation - 100 bytes remain available for the total pool
1795da6
to
1b1223f
Compare
// API: multiple registrations using the same hashed consumer, | ||
// will be recognized as the same in the TrackConsumersPool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the struct docs, the TrackConsumersPool is all about the MemoryConsumer. Therefore the top K consumers are returned, even if a given consumer has multiple reservations. This is intended to make the error messages (and subsequent debugging) more useful by correctly reflecting the aggregated top K.
} | ||
|
||
#[test] | ||
fn test_tracked_consumers_pool_deregister() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is about an existing behavior. The current MemoryPool implementations will register/deregister on the MemoryConsumer level; whereas the bytes incr/decrement is on the MemoryReservation level.
This means that a consumer can be deregistered, even while the reservation still holds memory.
The new TrackConsumersPool has nothing to do with this existing behavior. However, this test is added in order to demonstrate what/how the error messages will read.
@alamb -- I cannot reproduce these test failures locally, even with the same flags & tokio multithreaded tests enabled. Can you help? |
I also could not reproduce it with
One way to debug such failures is to add additional debug logging to the tests -- right now the test failure simply says "failed" with a static message, but it doesn't say what the actual result was For example you could potentially update your asserts like this: diff --git a/datafusion/execution/src/memory_pool/pool.rs b/datafusion/execution/src/memory_pool/pool.rs
index 9ce14e41a..0df795a2b 100644
--- a/datafusion/execution/src/memory_pool/pool.rs
+++ b/datafusion/execution/src/memory_pool/pool.rs
@@ -494,12 +494,13 @@ mod tests {
// using the previously set sizes for other consumers
let mut r5 = MemoryConsumer::new("r5").register(&pool);
let expected = "Failed to allocate additional 150 bytes for r5 with 0 bytes already allocated - maximum available is 5. The top memory consumers (across reservations) are: r1 consumed 50 bytes, r3 consumed 20 bytes, r2 consumed 15 bytes";
+ let actual_result = r5.try_grow(150);
assert!(
matches!(
- r5.try_grow(150),
+ actual_result,
Err(DataFusionError::ResourcesExhausted(e)) if e.to_string().contains(expected)
),
- "should provide list of top memory consumers"
+ "should provide list of top memory consumers, got {actual_result:?}",
);
} And the message from your assertion failure should be more helpful for debugging |
given the tests pass on amd64 maybe the difference is related to newlines or something in the message? |
We have a backtrace inserted into the error message. Specifically, the expected is: When running locally, we get that^^ response.
This means I need to construct the concatenated error message differently. Fixing... |
bf07b81
to
09b20d2
Compare
…ror message" This reverts commit 09b20d2.
…he proper error bubbling (msg wrapping)
fdc60de
to
f75764e
Compare
CI was failing due to the concat of error messages. I fixed it two ways -- in order to demonstrate why the 2nd way is better.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- I think this is looking good
I had some comments but overall I think this could be merged as is with the follow ons you identified (improve the messages)
I also think once we have this in, we should consider using this pool as the default (rather than GreedyPool) given its better user experience, but I think it would be good to make that proposal as a follow on PR as well
/// Constructs a resources error based upon the individual [`MemoryReservation`]. | ||
/// | ||
/// The error references the `bytes already allocated` for the reservation, | ||
/// and not the total within the collective [`MemoryPool`], | ||
/// nor the total across multiple reservations with the same [`MemoryConsumer`]. | ||
#[inline(always)] | ||
fn insufficient_capacity_err( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of updating the error message in a follow on PR
} | ||
|
||
/// The top consumers in a report string. | ||
fn report_top(&self) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since top
is only used for this report maybe it should be a parameter to report_top
rather than a parameter on the pool 🤔
It seems like someone could want both the "top 10" consumers and "all consumers" from the same tracked pool but with the implementation they could only have one or the other
Is this the message you are proposing to fix in a follow on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of TrackConsumersPool for error reporting (when passed as Arc<dyn MemoryPool>
) is constrained by the trait definition. However, we could use the downcasted struct itself for runtime metrics as shown in this added commit. Is this what you were thinking?
|
||
// Test: will accumulate size changes per consumer, not per reservation | ||
r1.grow(20); | ||
let expected = "Resources exhausted with top memory consumers (across reservations) are: foo consumed 30 bytes. Error: Failed to allocate additional 150 bytes for foo with 20 bytes already allocated - maximum available is 70"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat confusing that in the same message foo is reported to have two different allocations:
foo consumed 30 bytes
for foo with 20 bytes already allocated
Is it possible to rationalize the errors somehow make this less confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is confusing. Because the error message for foo with 20 bytes already allocated
is the error from the MemoryReservation, which has 20 bytes.
Whereas the MemoryConsumer happens to have 2 reservations -- with a total of 30 bytes allotted (across reservations).
I'm hoping the followup PR (with the message change for the reservation-specific error) will make this more clear. Does the below read better?
Resources exhausted with top memory consumers (across reservations) are: foo consumed 30 bytes. Error: Failed to allocate additional 150 bytes for foo with 20 bytes already allocated for this reservation - 70 bytes remain available for the total pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please feel free to suggest better wording for this^^. 🙏🏼
TrackedMemoryPool
that has better error messages on exhaustion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- I think this looks good to me
I will note as follow on tasks
- Improving the default error message
- Using the
TrackConsumersPool
as the default memory pool
impl<I: MemoryPool> TrackConsumersPool<I> { | ||
/// Creates a new [`TrackConsumersPool`]. | ||
/// | ||
/// The `top` determines how many Top K [`MemoryConsumer`]s to include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
TrackedMemoryPool
that has better error messages on exhaustionTrackedMemoryPool
with better error messages on exhaustion
Which issue does this PR close?
Part of #11523
Rationale for this change
The OOMing error message returns information about the next incremental request of memory, and not for the biggest consumers of memory. As a result, we have spent time chasing OOMs in the wrong place.
What changes are included in this PR?
Includes a new MemoryPool implementation, which when (optionally) used with an inner MemoryPool will return a list of the top memory consumers on OOM.
This new
TrackConsumersPool
is optional, such that individual users can determine whether or not to use this capability (along with any associated overhead).Are these changes tested?
Yes.
Are there any user-facing changes?
Yes, as a new
TrackConsumersPool
.