Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spill to disk #4188

Merged
merged 1 commit into from
Oct 5, 2024
Merged

Spill to disk #4188

merged 1 commit into from
Oct 5, 2024

Conversation

benjaminwinger
Copy link
Collaborator

@benjaminwinger benjaminwinger commented Sep 4, 2024

Based on #3743

This feature allows Column chunk data from the partitioner during rel table copies to be written to disk and freed (to save memory) if the ChunkedNodeGroup in question will no longer be needed until the end of the copy.
Every full ChunkedNodeGroup (in the InMemChunkedGroupCollection) gets passed to the Spiller, which stores them in a locked unordered_set (so they can be quickly removed once they are needed again; we could probably use an EvictionQueue to store them, but I'm not convinced that performance benefit would be noticeable).

Data starts to be spilled to disk when the buffer pool is full and at least 50% of the buffer pool is used by column chunks (though this calculation includes chunks which aren't in the fullPartitionerGroups set and can't be spilled).

When it comes time for the data spilled to be used, it gets loaded back into memory and removed from the fullPartitionerGroups set (which could cause other chunks to be spilled if there is not enough memory; ideally we could make sure to use all the chunks already in memory first, but that would be complicated). The memory is then freed afterwards (originally all the chunks would be held in memory until the end of RelBatchInsert execution, but that's no longer possible so we now need mutable access to that data and move it out of the partitioner shared state when processing it so that it can be freed).

Future work

This is currently only enabled when the database is using the LocalFileSystem and it is not in in-memory mode. I feel like spilling to disk should not write to other types of filesystems, but it seems reasonable that this could spill to the local filesystem for databases stored on other filesystems. That will be more complicated though, and I don't know if there will be much demand for it.

@benjaminwinger benjaminwinger force-pushed the spill-to-disk branch 4 times, most recently from b76f823 to c6d7bf4 Compare September 10, 2024 14:55
@ray6080
Copy link
Contributor

ray6080 commented Sep 10, 2024

A bit of early comments here based on what I've played around with so for.

  • We should remove the copy.ovf at the end of the query. (actually I think copy.tmp or just .tmp, considering we might spill other stuff in the future, is a better name, as we reserve ovf for overflow file already)
  • We should provide a configuration option for users to specify the path of the temp file. (we should limit the path of temp file to local file system only for now)
  • I think we should allow spilling under in-memory mode with a default .tmp file path eventually (though I think it's hard to how beneficial this can be, as if bm runs out of memory, spilling usually is not going to make the copy work in the end) , but maybe disable the spilling by default, so users have to turn it on manually and they are aware that we will use their local file system.

@benjaminwinger benjaminwinger force-pushed the spill-to-disk branch 2 times, most recently from 1ae3ae2 to bb58bdb Compare September 11, 2024 17:00
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 92.22222% with 14 lines in your changes missing coverage. Please review.

Project coverage is 88.21%. Comparing base (88af88b) to head (fe1d109).
Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
src/include/main/settings.h 0.00% 5 Missing ⚠️
src/storage/store/column_chunk_data.cpp 86.95% 3 Missing ⚠️
src/common/in_mem_overflow_buffer.cpp 0.00% 2 Missing ⚠️
src/include/storage/store/column_chunk_data.h 66.66% 1 Missing ⚠️
src/storage/buffer_manager/buffer_manager.cpp 94.44% 1 Missing ⚠️
src/storage/buffer_manager/spiller.cpp 96.87% 1 Missing ⚠️
test/copy/copy_test.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #4188    +/-   ##
========================================
  Coverage   88.20%   88.21%            
========================================
  Files        1345     1347     +2     
  Lines       52544    52877   +333     
  Branches     6987     7025    +38     
========================================
+ Hits        46348    46645   +297     
- Misses       6024     6058    +34     
- Partials      172      174     +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This comment was marked as outdated.

This comment was marked as outdated.

@benjaminwinger
Copy link
Collaborator Author

We should remove the copy.ovf at the end of the query. (actually I think copy.tmp or just .tmp, considering we might spill other stuff in the future, is a better name, as we reserve ovf for overflow file already)

Done.

We should provide a configuration option for users to specify the path of the temp file. (we should limit the path of temp file to local file system only for now)

I've set up the framework to pass this through the buffermanager, but how should it be configured?

I think we should allow spilling under in-memory mode with a default .tmp file path eventually (though I think it's hard to how beneficial this can be, as if bm runs out of memory, spilling usually is not going to make the copy work in the end) , but maybe disable the spilling by default, so users have to turn it on manually and they are aware that we will use their local file system.

At the very least it would have to be manually with the current defaults since the would be no database directory to store it in. The only benefit I can see is that after compression the size may be smaller than during the copy. But it does seem to me like the easiest way to enable spilling to disk if you're using in-memory mode and running out of memory would be to switch to an on-disk database.

This comment was marked as outdated.

@benjaminwinger benjaminwinger marked this pull request as ready for review September 13, 2024 13:52

This comment was marked as outdated.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u add some benchmark numbers you've done for now?

We should also add free pages management in the spiller as a future TODO item. Currently, we always append to the end of the spilling file, while there should be cases we can potentially reuse pages.

src/include/processor/operator/partitioner.h Show resolved Hide resolved
src/include/storage/buffer_manager/spiller.h Outdated Show resolved Hide resolved
src/main/database.cpp Outdated Show resolved Hide resolved
src/processor/operator/persistent/rel_batch_insert.cpp Outdated Show resolved Hide resolved
// support multiple databases spilling at once (can't be the same file), and handle different
// platforms.
if (!main::DBConfig::isDBPathInMemory(databasePath) &&
dynamic_cast<LocalFileSystem*>(vfs->findFileSystem(spillToDiskPath))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to add an interface isSpillable() to vfs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. But when would we want to use this feature for another type of filesystem?
I'm also not sure how we would describe it. isSpillable sounds more like a question of whether it's possible, but we're not skipping this for remote filesystems just because it's not possible, but because it's impractical since the performance is poor.

On the other hand maybe it's not impractical if you're using fast network storage on your local network, so maybe there's an argument to allowing any filesystem when not using the default path (though I don't know why you wouldn't use something like NFS to treat it as a local filesystem in that case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i didn't make myself clear here. To clarify a bit more, I mainly meant if that is a better way to check if the vfs is local or not compared to the check using dynamic_cast. maybe the function should be isLocalFS().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the virtual call would actually be faster, but I'm not sure if it's really any better in general, particularly since this just runs once and I doubt the difference is significant (which I think it would only be if the inheritance tree is really large).

src/include/storage/buffer_manager/memory_manager.h Outdated Show resolved Hide resolved
src/include/storage/buffer_manager/memory_manager.h Outdated Show resolved Hide resolved
src/storage/store/chunked_node_group.cpp Outdated Show resolved Hide resolved

void ChunkedNodeGroup::loadFromDisk(MemoryManager& mm) {
mm.getBufferManager()->getSpillerOrSkip([&](auto& spiller) {
std::unique_lock lock{spillToDiskMutex};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to assert on dataInUse here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if the chunk hasn't already been spilled to disk it's possible for it to be spilled to disk as this is running, if another thread needs the memory.
dataInUse could actually be true here since the last chunked group in each collection won't ever get marked as unused.

@@ -461,5 +464,36 @@ std::unique_ptr<ChunkedNodeGroup> ChunkedNodeGroup::deserialize(MemoryManager& m
return chunkedGroup;
}

void ChunkedNodeGroup::setUnused(MemoryManager& mm) {
dataInUse = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be protected with lock too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to be. The point of the lock is that once it gets added to the set of unused chunks any thread may attempt to spill it to disk, but before that point it should only be accessed by a single thread.

This comment was marked as outdated.

This comment was marked as outdated.

@benjaminwinger
Copy link
Collaborator Author

Can u add some benchmark numbers you've done for now?

I don't have anything specific to hand; I'll update with some later, but benchmarking this is a little tricky since, with the exception of comparing to using a swapfile (which I did try once and found to be prohibitively slow by comparison) it's hard to do a meaningful comparison. When running with a restricted buffer pool size to trigger this, any dataset with a large hash index will suffer more from having to repeatedly re-read the hash index then it will from having to spill to disk. But on datasets with a small hash index I did see a fairly reasonable loss of performance the more I restricted the buffer pool size and made it rely on spilling.

We should also add free pages management in the spiller as a future TODO item. Currently, we always append to the end of the spilling file, while there should be cases we can potentially reuse pages.

Yes. In practice everything is the same size anyway, so it should be easy enough to make that explicit and maintain a list of free regions which can be written to first.

Copy link

Benchmark Result

Master commit hash: f1ee4f1ad13137a1a599ec2327f8d3eb6f3e15ab
Branch commit hash: 70f934b8ceee45be2622a00fea11541a48691aa8

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 651.21 659.23 -8.02 (-1.22%)
aggregation q28 15917.19 12069.38 3847.81 (31.88%)
filter q14 128.55 136.93 -8.38 (-6.12%)
filter q15 130.31 142.12 -11.81 (-8.31%)
filter q16 302.31 314.56 -12.25 (-3.89%)
filter q17 454.68 458.65 -3.97 (-0.87%)
filter q18 1958.20 1971.12 -12.92 (-0.66%)
fixed_size_expr_evaluator q07 547.72 557.28 -9.56 (-1.72%)
fixed_size_expr_evaluator q08 764.09 762.72 1.37 (0.18%)
fixed_size_expr_evaluator q09 767.88 766.90 0.98 (0.13%)
fixed_size_expr_evaluator q10 242.54 255.42 -12.88 (-5.04%)
fixed_size_expr_evaluator q11 238.12 249.86 -11.74 (-4.70%)
fixed_size_expr_evaluator q12 238.62 247.20 -8.58 (-3.47%)
fixed_size_expr_evaluator q13 1492.64 1486.02 6.62 (0.45%)
fixed_size_seq_scan q23 123.68 136.17 -12.50 (-9.18%)
join q31 12.40 11.43 0.97 (8.46%)
ldbc_snb_ic q35 464.70 469.77 -5.07 (-1.08%)
ldbc_snb_ic q36 41.20 40.96 0.23 (0.57%)
ldbc_snb_is q32 8.83 8.56 0.27 (3.21%)
ldbc_snb_is q33 15.22 14.79 0.43 (2.91%)
ldbc_snb_is q34 8.77 8.76 0.01 (0.09%)
multi-rel multi-rel-large-scan 1854.34 1873.21 -18.86 (-1.01%)
multi-rel multi-rel-lookup 53.45 76.09 -22.64 (-29.75%)
multi-rel multi-rel-small-scan 40.86 57.63 -16.77 (-29.10%)
order_by q25 134.83 146.79 -11.96 (-8.15%)
order_by q26 454.43 462.39 -7.96 (-1.72%)
order_by q27 1438.76 1417.01 21.75 (1.53%)
scan_after_filter q01 176.34 185.22 -8.89 (-4.80%)
scan_after_filter q02 164.80 172.82 -8.02 (-4.64%)
shortest_path_ldbc100 q37 3412.32 3463.46 -51.13 (-1.48%)
shortest_path_ldbc100 q39 60.18 73.29 -13.12 (-17.90%)
shortest_path_ldbc100 q40 76.65 75.73 0.92 (1.22%)
var_size_expr_evaluator q03 2060.59 2099.78 -39.20 (-1.87%)
var_size_expr_evaluator q04 2276.39 2292.68 -16.29 (-0.71%)
var_size_expr_evaluator q05 2546.83 2637.68 -90.85 (-3.44%)
var_size_expr_evaluator q06 1390.66 1430.13 -39.47 (-2.76%)
var_size_seq_scan q19 1469.47 1499.12 -29.65 (-1.98%)
var_size_seq_scan q20 3166.27 3182.87 -16.60 (-0.52%)
var_size_seq_scan q21 2434.81 2461.22 -26.41 (-1.07%)
var_size_seq_scan q22 130.21 130.79 -0.57 (-0.44%)

@benjaminwinger
Copy link
Collaborator Author

Benchmarks:
Copying the first 194 million relationships in the twitter-2010 dataset (the hash index is 1.1GB)

  • Unrestricted: ~12GB peak memory, ~23s copy time
  • Buffer pool size of 8GB: ~9GB peak memory, ~25s copy time, ~2.1GB spilled
  • Buffer pool size of 4GB: ~6GB peak memory, ~29s copy time, ~6.8GB spilled
  • Buffer pool size of 3GB: ~4GB peak memory, ~30s copy time, ~8GB spilled
  • Buffer pool size of 2GB runs out of memory

I'm also running into a segfault in the database initialization when trying to load a database which failed to copy. It doesn't seem to be related to this PR.

@benjaminwinger benjaminwinger force-pushed the spill-to-disk branch 4 times, most recently from 4c95d6c to ca4b137 Compare September 26, 2024 14:59
@benjaminwinger benjaminwinger force-pushed the spill-to-disk branch 2 times, most recently from fe1c020 to a07e2aa Compare October 2, 2024 09:45
Copy link

github-actions bot commented Oct 3, 2024

Benchmark Result

Master commit hash: fae65c679902ea73721784378e64b13421ff237a
Branch commit hash: a98545833f0acc2b772b429fe1aab669a4cc3883

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 650.51 761.44 -110.93 (-14.57%)
aggregation q28 11186.54 13960.87 -2774.33 (-19.87%)
filter q14 126.76 130.40 -3.63 (-2.79%)
filter q15 130.25 145.75 -15.49 (-10.63%)
filter q16 305.86 311.83 -5.97 (-1.91%)
filter q17 453.66 532.64 -78.98 (-14.83%)
filter q18 1967.32 2115.02 -147.70 (-6.98%)
fixed_size_expr_evaluator q07 544.31 548.37 -4.06 (-0.74%)
fixed_size_expr_evaluator q08 755.61 750.63 4.98 (0.66%)
fixed_size_expr_evaluator q09 753.74 753.58 0.15 (0.02%)
fixed_size_expr_evaluator q10 247.57 247.81 -0.24 (-0.10%)
fixed_size_expr_evaluator q11 239.95 245.69 -5.74 (-2.34%)
fixed_size_expr_evaluator q12 239.25 238.00 1.25 (0.53%)
fixed_size_expr_evaluator q13 1485.42 1503.25 -17.83 (-1.19%)
fixed_size_seq_scan q23 121.62 135.20 -13.59 (-10.05%)
join q29 657.79 674.37 -16.59 (-2.46%)
join q30 1440.06 1485.53 -45.47 (-3.06%)
join q31 10.31 12.99 -2.69 (-20.68%)
ldbc_snb_ic q35 434.97 460.53 -25.56 (-5.55%)
ldbc_snb_ic q36 41.20 42.29 -1.08 (-2.56%)
ldbc_snb_is q32 9.37 9.45 -0.07 (-0.77%)
ldbc_snb_is q33 18.89 15.72 3.18 (20.20%)
ldbc_snb_is q34 8.60 7.94 0.66 (8.29%)
multi-rel multi-rel-large-scan 1672.73 1898.41 -225.68 (-11.89%)
multi-rel multi-rel-lookup 74.85 71.08 3.77 (5.30%)
multi-rel multi-rel-small-scan 79.04 72.56 6.48 (8.93%)
order_by q25 129.30 138.36 -9.07 (-6.55%)
order_by q26 456.24 459.26 -3.02 (-0.66%)
order_by q27 1415.94 1438.87 -22.93 (-1.59%)
scan_after_filter q01 178.90 181.20 -2.29 (-1.27%)
scan_after_filter q02 164.49 168.73 -4.24 (-2.51%)
shortest_path_ldbc100 q37 3460.09 3792.50 -332.41 (-8.76%)
shortest_path_ldbc100 q38 78.67 68.74 9.93 (14.44%)
shortest_path_ldbc100 q39 242.45 53.25 189.20 (355.29%)
shortest_path_ldbc100 q40 85.04 84.78 0.26 (0.31%)
var_size_expr_evaluator q03 2073.86 2129.77 -55.91 (-2.63%)
var_size_expr_evaluator q04 2282.80 2289.23 -6.43 (-0.28%)
var_size_expr_evaluator q05 2620.16 2559.17 60.99 (2.38%)
var_size_expr_evaluator q06 1387.44 1406.79 -19.35 (-1.38%)
var_size_seq_scan q19 1480.81 1506.96 -26.15 (-1.74%)
var_size_seq_scan q20 3186.58 3061.94 124.64 (4.07%)
var_size_seq_scan q21 2452.38 2511.52 -59.13 (-2.35%)
var_size_seq_scan q22 130.19 133.08 -2.89 (-2.17%)

@andyfengHKU andyfengHKU merged commit 4ba09ea into master Oct 5, 2024
24 of 25 checks passed
@andyfengHKU andyfengHKU deleted the spill-to-disk branch October 5, 2024 02:35
ted-wq-x pushed a commit to ted-wq-x/kuzu that referenced this pull request Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants