feat: track memory consumers for GreedyMemoryPoolState #9015

asimsedhain · 2024-01-27T07:35:44Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Modified GreedyMemoryPool and FairMemoryPool to keep track of all it's consumer such that they can be dumped for debugging.

The output for the debug looks as follows:

Are these changes tested?

Are there any user-facing changes?

alamb · 2024-01-27T16:35:02Z

This looks very cool!

alamb

Thank you for this contribution @asimsedhain -- this is a great start. I am slightly worried it will cause a performance regression

I am running some benchmarks now and will report results here

datafusion/execution/src/memory_pool/pool.rs

alamb · 2024-01-31T21:58:18Z

datafusion/execution/src/memory_pool/pool.rs

+        let free = self.pool_size.saturating_sub(used);
+
+        let mut allocations = state.pool_members.iter().collect::<Vec<_>>();
+        allocations.sort_by(|(_, a), (_, b)| a.cmp(b).reverse());


datafusion/execution/src/memory_pool/pool.rs

alamb · 2024-01-31T22:46:19Z

Benchmark results (I don't really see any noticeable change). Not sure why it looks like this branch is faster in some cases. I'll rerun to double check

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ df-mem-pool_6934_debug-dump-memory-pool ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  209.82ms │                                208.29ms │     no change │
│ QQuery 2     │   42.70ms │                                 43.69ms │     no change │
│ QQuery 3     │   76.14ms │                                 76.32ms │     no change │
│ QQuery 4     │   74.19ms │                                 73.08ms │     no change │
│ QQuery 5     │  120.29ms │                                118.37ms │     no change │
│ QQuery 6     │   16.58ms │                                 15.76ms │     no change │
│ QQuery 7     │  310.74ms │                                300.37ms │     no change │
│ QQuery 8     │   74.19ms │                                 74.38ms │     no change │
│ QQuery 9     │  124.90ms │                                123.22ms │     no change │
│ QQuery 10    │  150.40ms │                                147.62ms │     no change │
│ QQuery 11    │   34.56ms │                                 34.43ms │     no change │
│ QQuery 12    │   66.94ms │                                 64.90ms │     no change │
│ QQuery 13    │   83.11ms │                                 84.24ms │     no change │
│ QQuery 14    │   24.58ms │                                 24.65ms │     no change │
│ QQuery 15    │   55.08ms │                                 54.35ms │     no change │
│ QQuery 16    │   45.64ms │                                 45.04ms │     no change │
│ QQuery 17    │  169.01ms │                                158.65ms │ +1.07x faster │
│ QQuery 18    │  462.15ms │                                448.99ms │     no change │
│ QQuery 19    │   63.04ms │                                 58.88ms │ +1.07x faster │
│ QQuery 20    │  114.51ms │                                113.25ms │     no change │
│ QQuery 21    │  350.09ms │                                344.54ms │     no change │
│ QQuery 22    │   31.29ms │                                 29.17ms │ +1.07x faster │
└──────────────┴───────────┴─────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                                 │ 2699.94ms │
│ Total Time (df-mem-pool_6934_debug-dump-memory-pool)   │ 2642.17ms │
│ Average Time (main_base)                               │  122.72ms │
│ Average Time (df-mem-pool_6934_debug-dump-memory-pool) │  120.10ms │
│ Queries Faster                                         │         3 │
│ Queries Slower                                         │         0 │
│ Queries with No Change                                 │        19 │
└────────────────────────────────────────────────────────┴───────────┘

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ df-mem-pool_6934_debug-dump-memory-pool ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.90ms │                                  0.93ms │     no change │
│ QQuery 1     │    91.92ms │                                 86.94ms │ +1.06x faster │
│ QQuery 2     │   193.55ms │                                192.43ms │     no change │
│ QQuery 3     │   202.04ms │                                194.44ms │     no change │
│ QQuery 4     │  2035.67ms │                               2064.17ms │     no change │
│ QQuery 5     │  2634.69ms │                               2710.60ms │     no change │
│ QQuery 6     │    80.50ms │                                 83.28ms │     no change │
│ QQuery 7     │    96.91ms │                                 91.12ms │ +1.06x faster │
│ QQuery 8     │  3141.36ms │                               3072.50ms │     no change │
│ QQuery 9     │  2238.23ms │                               2249.01ms │     no change │
│ QQuery 10    │   778.65ms │                                805.28ms │     no change │
│ QQuery 11    │   851.89ms │                                892.85ms │     no change │
│ QQuery 12    │  2480.74ms │                               2476.18ms │     no change │
│ QQuery 13    │  4782.56ms │                               4655.77ms │     no change │
│ QQuery 14    │  2708.77ms │                               2730.93ms │     no change │
│ QQuery 15    │  2265.00ms │                               2311.53ms │     no change │
│ QQuery 16    │  5628.48ms │                               5469.02ms │     no change │
│ QQuery 17    │  5499.17ms │                               5365.51ms │     no change │
│ QQuery 18    │ 11546.73ms │                              11089.49ms │     no change │
│ QQuery 19    │   160.21ms │                                156.05ms │     no change │
│ QQuery 20    │  2627.86ms │                               2683.35ms │     no change │
│ QQuery 21    │  3367.59ms │                               3493.11ms │     no change │
│ QQuery 22    │  9249.51ms │                               9366.63ms │     no change │
│ QQuery 23    │ 21063.00ms │                              21867.58ms │     no change │
│ QQuery 24    │  1325.68ms │                               1395.85ms │  1.05x slower │
│ QQuery 25    │  1149.54ms │                               1184.26ms │     no change │
│ QQuery 26    │  1461.56ms │                               1491.37ms │     no change │
│ QQuery 27    │  3855.43ms │                               3903.53ms │     no change │
│ QQuery 28    │ 30624.92ms │                              27627.27ms │ +1.11x faster │
│ QQuery 29    │  1061.67ms │                               1031.11ms │     no change │
│ QQuery 30    │  2473.21ms │                               2359.27ms │     no change │
│ QQuery 31    │  3170.14ms │                               3067.15ms │     no change │
│ QQuery 32    │ 16379.80ms │                              15484.19ms │ +1.06x faster │
│ QQuery 33    │ 12534.18ms │                              11905.42ms │ +1.05x faster │
│ QQuery 34    │ 13092.84ms │                              12631.76ms │     no change │
│ QQuery 35    │  3839.99ms │                               3575.27ms │ +1.07x faster │
│ QQuery 36    │   414.06ms │                                434.18ms │     no change │
│ QQuery 37    │   253.10ms │                                245.90ms │     no change │
│ QQuery 38    │   195.16ms │                                191.95ms │     no change │
│ QQuery 39    │  1094.19ms │                               1051.57ms │     no change │
│ QQuery 40    │    84.61ms │                                 86.44ms │     no change │
│ QQuery 41    │    81.51ms │                                 79.40ms │     no change │
│ QQuery 42    │    89.56ms │                                 89.58ms │     no change │
└──────────────┴────────────┴─────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                                      ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)                                 │ 176907.09ms │
│ Total Time (df-mem-pool_6934_debug-dump-memory-pool)   │ 171944.20ms │
│ Average Time (main_base)                               │   4114.12ms │
│ Average Time (df-mem-pool_6934_debug-dump-memory-pool) │   3998.70ms │
│ Queries Faster                                         │           6 │
│ Queries Slower                                         │           1 │
│ Queries with No Change                                 │          36 │
└────────────────────────────────────────────────────────┴─────────────┘

asimsedhain · 2024-02-01T03:56:31Z

Benchmark results (I don't really see any noticeable change). Not sure why it looks like this branch is faster in some cases. I'll rerun to double check

I am also surprised this branch is faster.

Thank you for the review! Will address the comments shortly.

alamb · 2024-02-01T20:58:06Z

I reran the benchmarks and basically they showed similar results: performance differences that were likely mostly noise

alamb · 2024-02-09T01:58:45Z

Marking as a draft so it is clear this PR is not waiting on reviews for now. I am trying to keep our review backlog down. Please mark this as ready for review when it is ready for another look

alamb

Thanks @asimsedhain 🙏

datafusion/execution/src/memory_pool/pool.rs

asimsedhain · 2024-02-27T06:33:13Z

@alamb a soft bump on this PR for another review.

alamb · 2024-02-27T22:44:20Z

Thansk @asimsedhain -- sorry I was out last week. I will put this on my review queue for tomorrow

alamb

Thanks @asimsedhain -- this looks great. My only concern is about using saturating_sub rather than checked_sub as it could potentially mask bugs, but otherwise I think this PR looks good to go.

Sorry again for the delay in review

datafusion/execution/src/memory_pool/pool.rs

alamb · 2024-02-29T20:58:38Z

datafusion/execution/src/memory_pool/pool.rs

+
+        assert_eq!(
+            format!("{}", greedy_pool),
+            "GreedyPool 2 allocations, 421 used, 579 free, 1000 capacity.\nConsumers:\n321: r2\n100: r1\n"


datafusion/execution/src/memory_pool/pool.rs

alamb · 2024-02-29T21:01:34Z

datafusion/execution/src/memory_pool/pool.rs

        if consumer.can_spill {
-            let mut state = self.state.lock();
-            state.num_spill = state.num_spill.checked_sub(1).unwrap();
+            state.num_spill = state.num_spill.saturating_sub(1);


I wonder if there is a reason to use saturating_sub rather than checked_sub?

asimsedhain · 2024-03-01T07:30:53Z

Thanks @asimsedhain -- this looks great. My only concern is about using saturating_sub rather than checked_sub as it could potentially mask bugs, but otherwise I think this PR looks good to go.

Sorry again for the delay in review

True, I didn't consider that it could mask potential bugs. Replaced it with checked add.
Thank you for the review!

The CI failed. Will check on it in the morning.

alamb · 2024-03-01T12:01:17Z

Thanks @asimsedhain

asimsedhain · 2024-03-02T19:25:24Z

I believe I found the reason why the CI is failing after adding the checked_sub calls.
I traced out all the calls issued to memory pool in the test aggregate_source_with_yielding_with_spill .

    1 Register: GroupedHashAggregateStream[0]
    2 GroupedHashAggregateStream[0] about to grow. additional: 120 before grow used: 0 after grow used: 120
    3 GroupedHashAggregateStream[0] about to shrink. shrink: 120 before shrink used: 120 after shrink used: 0
    4 Unregister: GroupedHashAggregateStream[0]
    5 Register: GroupedHashAggregateStream[0]
    6 Register: GroupedHashAggregateStream[0]
    7 GroupedHashAggregateStream[0] about to grow. additional: 208 before grow used: 0 after grow used: 208
    8 GroupedHashAggregateStream[0] about to grow. additional: 120 before grow used: 0 after grow used: 328
    9 GroupedHashAggregateStream[0] about to shrink. shrink: 48 before shrink used: 328 after shrink used: 280
   10 GroupedHashAggregateStream[0] about to shrink. shrink: 40 before shrink used: 280 after shrink used: 240
   11 GroupedHashAggregateStream[0] about to shrink. shrink: 120 before shrink used: 240 after shrink used: 120
   12 Unregister: GroupedHashAggregateStream[0]
   13 GroupedHashAggregateStream[0] about to grow. additional: 348 before grow used: 0 after grow used: 468
   14 GroupedHashAggregateStream[0] about to grow. additional: 1248 before grow used: 0 after grow used: 1716
   15 GroupedHashAggregateStream[0] about to grow. additional: 120 before grow used: 0 after grow used: 1836
   16 GroupedHashAggregateStream[0] about to shrink. shrink: 36 before shrink used: 1836 after shrink used: 1800
   17 GroupedHashAggregateStream[0] about to grow. additional: 36 before grow used: 0 after grow used: 1836
   18 GroupedHashAggregateStream[0] about to shrink. shrink: 36 before shrink used: 1836 after shrink used: 1800
   19 GroupedHashAggregateStream[0] about to shrink. shrink: 12 before shrink used: 1800 after shrink used: 1788
   20 GroupedHashAggregateStream[0] about to shrink. shrink: 56 before shrink used: 1788 after shrink used: 1732
   21 GroupedHashAggregateStream[0] about to shrink. shrink: 1596 before shrink used: 1732 after shrink used: 136
   22 GroupedHashAggregateStream[0] about to shrink. shrink: 136 before shrink used: 136 after shrink used: 0
   23 Unregister: GroupedHashAggregateStream[0]

On line 5 and 6, two consumers register themselves with the same name (these 2 aggregation happen here].

Seems like assuming each consumer to have a unique name is not correct.
Another approach I can think of right now is to store a Weak ref to the memory consumer but I believe that might change the MemoryPool Api.

@alamb could you let me know if you have any suggestion on how to proceed forward?

alamb · 2024-03-04T22:30:38Z

Hi @asimsedhain - I have this PR on my list to review, but I sadly have limited bandwidth this week

asimsedhain · 2024-03-06T04:52:32Z

Hi @asimsedhain - I have this PR on my list to review, but I sadly have limited bandwidth this week

No worries @alamb. Whenever you get the free time.
I will try to see if I can implement the approach I mentioned above.

CI is having trouble, let's sort that out first

yjshen · 2024-03-11T02:49:46Z

Seems like assuming each consumer to have a unique name is not correct.

Introduce an ID field to the memory consumer, track memory usage with consumer ID, and do ID -> Name conversion at the time of usage dump?

alamb · 2024-03-29T15:04:46Z

I am sorry @asimsedhain -- I am struggling to find the contiguous time needed to review this PR as it makes changes that look potentially disruptive in a very important subsustem (the memory manager)

I actually think it is a really important debugging feature (e.g. @westonpace was just asking me the other day about where the memory was going when a query hit the greedy pool's memory limit)

westonpace · 2024-03-29T17:38:31Z

Yes, I'll chime in a bit as I did end up finding the root issue. My plan had both SortExec and SortPreservingMergeExec and I had incorrectly assumed SortPreservingMergeExec was not a spillable operator and so I was confused why I was getting out-of-memory errors. Having a tool like you've shown that would help to visualize memory usage would have allowed me to figure that out more quickly.

Also, we have some customers that want to know, in great detail, why various operations need RAM and what the limits of that RAM will be. I think this could be just a start into greater RAM observability.

github-actions · 2024-05-29T01:49:08Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

alamb · 2024-06-04T20:44:29Z

Converting to draft -- I am sorry this one never quite made it in.

wiedld · 2024-07-17T19:44:36Z

Note that the need for some tracking of memory consumers has been raised again. Not sure if we plan on reviving/reusing some of the work done already here by @asimsedhain as a first step.

alamb · 2024-07-17T20:06:15Z

My concern with this PR was that I could not quite convince myself / find the time to ensure the complexity was necessary... I feel bad we didn't find time to get it over the line. Maybe we can revive it

asimsedhain force-pushed the df-mem-pool/6934/debug-dump-memory-pool branch from c39dc03 to a68c353 Compare January 29, 2024 06:37

asimsedhain marked this pull request as ready for review January 29, 2024 06:39

asimsedhain added 2 commits January 29, 2024 10:22

feat: track memory consumers for GreedyMemoryPoolState

ed04f35

feat: track member consumers for FairSpillPool

76ce658

asimsedhain force-pushed the df-mem-pool/6934/debug-dump-memory-pool branch from a68c353 to 76ce658 Compare January 29, 2024 16:23

alamb mentioned this pull request Jan 31, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 29, 2024 #9030

Closed

6 tasks

alamb reviewed Jan 31, 2024

View reviewed changes

alamb marked this pull request as draft February 9, 2024 01:58

asimsedhain added 2 commits February 10, 2024 21:57

perf: use hashbrown to remove alloc on lookup

ac4484d

refactor: add common methods in pool meta

a97420e

alamb mentioned this pull request Feb 12, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 12, 2024 #9200

Closed

8 tasks

alamb reviewed Feb 12, 2024

View reviewed changes

datafusion/execution/src/memory_pool/pool.rs Show resolved Hide resolved

asimsedhain added 2 commits February 12, 2024 23:47

refactor: PoolMeta into SimplePool

bf44a3a

test: add test for MemoryPool display

e4ce326

asimsedhain marked this pull request as ready for review February 13, 2024 06:09

asimsedhain closed this Feb 27, 2024

asimsedhain reopened this Feb 27, 2024

alamb mentioned this pull request Feb 28, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 26, 2024 #9345

Closed

9 tasks

alamb previously approved these changes Feb 29, 2024

View reviewed changes

refactor: address comments

eed8530

github-actions bot added the Stale PR has not had any activity for some time label May 29, 2024

alamb marked this pull request as draft June 4, 2024 20:44

alamb closed this Jun 4, 2024

wiedld mentioned this pull request Jul 17, 2024

Resources exhausted errors are confusing return the biggest memory consumers. #11523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: track memory consumers for GreedyMemoryPoolState #9015

feat: track memory consumers for GreedyMemoryPoolState #9015

asimsedhain commented Jan 27, 2024 •

edited

Loading

alamb commented Jan 27, 2024

alamb left a comment

alamb Jan 31, 2024

alamb commented Jan 31, 2024

asimsedhain commented Feb 1, 2024 •

edited

Loading

alamb commented Feb 1, 2024

alamb commented Feb 9, 2024

alamb left a comment

asimsedhain commented Feb 27, 2024

alamb commented Feb 27, 2024

alamb left a comment

alamb Feb 29, 2024

alamb Feb 29, 2024

asimsedhain commented Mar 1, 2024 •

edited

Loading

alamb commented Mar 1, 2024

asimsedhain commented Mar 2, 2024

alamb commented Mar 4, 2024

asimsedhain commented Mar 6, 2024

yjshen commented Mar 11, 2024

alamb commented Mar 29, 2024

westonpace commented Mar 29, 2024

github-actions bot commented May 29, 2024

alamb commented Jun 4, 2024

wiedld commented Jul 17, 2024

alamb commented Jul 17, 2024

feat: track memory consumers for GreedyMemoryPoolState #9015

feat: track memory consumers for GreedyMemoryPoolState #9015

Conversation

asimsedhain commented Jan 27, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Jan 27, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 31, 2024

Choose a reason for hiding this comment

alamb commented Jan 31, 2024

asimsedhain commented Feb 1, 2024 • edited Loading

alamb commented Feb 1, 2024

alamb commented Feb 9, 2024

alamb left a comment

Choose a reason for hiding this comment

asimsedhain commented Feb 27, 2024

alamb commented Feb 27, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 29, 2024

Choose a reason for hiding this comment

alamb Feb 29, 2024

Choose a reason for hiding this comment

asimsedhain commented Mar 1, 2024 • edited Loading

alamb commented Mar 1, 2024

asimsedhain commented Mar 2, 2024

alamb commented Mar 4, 2024

asimsedhain commented Mar 6, 2024

yjshen commented Mar 11, 2024

alamb commented Mar 29, 2024

westonpace commented Mar 29, 2024

github-actions bot commented May 29, 2024

alamb commented Jun 4, 2024

wiedld commented Jul 17, 2024

alamb commented Jul 17, 2024

asimsedhain commented Jan 27, 2024 •

edited

Loading

asimsedhain commented Feb 1, 2024 •

edited

Loading

asimsedhain commented Mar 1, 2024 •

edited

Loading