Spill to disk #4188

benjaminwinger · 2024-09-04T15:02:01Z

Based on #3743

This feature allows Column chunk data from the partitioner during rel table copies to be written to disk and freed (to save memory) if the ChunkedNodeGroup in question will no longer be needed until the end of the copy.
Every full ChunkedNodeGroup (in the InMemChunkedGroupCollection) gets passed to the Spiller, which stores them in a locked unordered_set (so they can be quickly removed once they are needed again; we could probably use an EvictionQueue to store them, but I'm not convinced that performance benefit would be noticeable).

Data starts to be spilled to disk when the buffer pool is full and at least 50% of the buffer pool is used by column chunks (though this calculation includes chunks which aren't in the fullPartitionerGroups set and can't be spilled).

When it comes time for the data spilled to be used, it gets loaded back into memory and removed from the fullPartitionerGroups set (which could cause other chunks to be spilled if there is not enough memory; ideally we could make sure to use all the chunks already in memory first, but that would be complicated). The memory is then freed afterwards (originally all the chunks would be held in memory until the end of RelBatchInsert execution, but that's no longer possible so we now need mutable access to that data and move it out of the partitioner shared state when processing it so that it can be freed).

Future work

This is currently only enabled when the database is using the LocalFileSystem and it is not in in-memory mode. I feel like spilling to disk should not write to other types of filesystems, but it seems reasonable that this could spill to the local filesystem for databases stored on other filesystems. That will be more complicated though, and I don't know if there will be much demand for it.

ray6080 · 2024-09-10T17:04:49Z

A bit of early comments here based on what I've played around with so for.

We should remove the copy.ovf at the end of the query. (actually I think copy.tmp or just .tmp, considering we might spill other stuff in the future, is a better name, as we reserve ovf for overflow file already)
We should provide a configuration option for users to specify the path of the temp file. (we should limit the path of temp file to local file system only for now)
I think we should allow spilling under in-memory mode with a default .tmp file path eventually (though I think it's hard to how beneficial this can be, as if bm runs out of memory, spilling usually is not going to make the copy work in the end) , but maybe disable the spilling by default, so users have to turn it on manually and they are aware that we will use their local file system.

codecov · 2024-09-11T17:37:48Z

Codecov Report

Attention: Patch coverage is 92.22222% with 14 lines in your changes missing coverage. Please review.

Project coverage is 88.21%. Comparing base (88af88b) to head (fe1d109).
Report is 13 commits behind head on master.

Files with missing lines	Patch %	Lines
src/include/main/settings.h	0.00%	5 Missing ⚠️
src/storage/store/column_chunk_data.cpp	86.95%	3 Missing ⚠️
src/common/in_mem_overflow_buffer.cpp	0.00%	2 Missing ⚠️
src/include/storage/store/column_chunk_data.h	66.66%	1 Missing ⚠️
src/storage/buffer_manager/buffer_manager.cpp	94.44%	1 Missing ⚠️
src/storage/buffer_manager/spiller.cpp	96.87%	1 Missing ⚠️
test/copy/copy_test.cpp	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #4188    +/-   ##
========================================
  Coverage   88.20%   88.21%            
========================================
  Files        1345     1347     +2     
  Lines       52544    52877   +333     
  Branches     6987     7025    +38     
========================================
+ Hits        46348    46645   +297     
- Misses       6024     6058    +34     
- Partials      172      174     +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benjaminwinger · 2024-09-13T09:52:23Z

We should remove the copy.ovf at the end of the query. (actually I think copy.tmp or just .tmp, considering we might spill other stuff in the future, is a better name, as we reserve ovf for overflow file already)

Done.

We should provide a configuration option for users to specify the path of the temp file. (we should limit the path of temp file to local file system only for now)

I've set up the framework to pass this through the buffermanager, but how should it be configured?

I think we should allow spilling under in-memory mode with a default .tmp file path eventually (though I think it's hard to how beneficial this can be, as if bm runs out of memory, spilling usually is not going to make the copy work in the end) , but maybe disable the spilling by default, so users have to turn it on manually and they are aware that we will use their local file system.

At the very least it would have to be manually with the current defaults since the would be no database directory to store it in. The only benefit I can see is that after compression the size may be smaller than during the copy. But it does seem to me like the easiest way to enable spilling to disk if you're using in-memory mode and running out of memory would be to switch to an on-disk database.

ray6080

Can u add some benchmark numbers you've done for now?

We should also add free pages management in the spiller as a future TODO item. Currently, we always append to the end of the spilling file, while there should be cases we can potentially reuse pages.

src/include/processor/operator/partitioner.h

src/include/storage/buffer_manager/spiller.h

src/main/database.cpp

src/processor/operator/persistent/rel_batch_insert.cpp

ray6080 · 2024-09-17T06:03:18Z

src/storage/buffer_manager/buffer_manager.cpp

+    // support multiple databases spilling at once (can't be the same file), and handle different
+    // platforms.
+    if (!main::DBConfig::isDBPathInMemory(databasePath) &&
+        dynamic_cast<LocalFileSystem*>(vfs->findFileSystem(spillToDiskPath))) {


Is it better to add an interface isSpillable() to vfs?

Maybe. But when would we want to use this feature for another type of filesystem?
I'm also not sure how we would describe it. isSpillable sounds more like a question of whether it's possible, but we're not skipping this for remote filesystems just because it's not possible, but because it's impractical since the performance is poor.

On the other hand maybe it's not impractical if you're using fast network storage on your local network, so maybe there's an argument to allowing any filesystem when not using the default path (though I don't know why you wouldn't use something like NFS to treat it as a local filesystem in that case).

hmm i didn't make myself clear here. To clarify a bit more, I mainly meant if that is a better way to check if the vfs is local or not compared to the check using dynamic_cast. maybe the function should be isLocalFS().

I guess the virtual call would actually be faster, but I'm not sure if it's really any better in general, particularly since this just runs once and I doubt the difference is significant (which I think it would only be if the inheritance tree is really large).

src/include/storage/buffer_manager/memory_manager.h

src/storage/store/chunked_node_group.cpp

ray6080 · 2024-09-17T14:20:10Z

src/storage/store/chunked_node_group.cpp

+
+void ChunkedNodeGroup::loadFromDisk(MemoryManager& mm) {
+    mm.getBufferManager()->getSpillerOrSkip([&](auto& spiller) {
+        std::unique_lock lock{spillToDiskMutex};


Would it be better to assert on dataInUse here?

No, if the chunk hasn't already been spilled to disk it's possible for it to be spilled to disk as this is running, if another thread needs the memory.
dataInUse could actually be true here since the last chunked group in each collection won't ever get marked as unused.

ray6080 · 2024-09-17T14:32:32Z

src/storage/store/chunked_node_group.cpp

@@ -461,5 +464,36 @@ std::unique_ptr<ChunkedNodeGroup> ChunkedNodeGroup::deserialize(MemoryManager& m
    return chunkedGroup;
 }

+void ChunkedNodeGroup::setUnused(MemoryManager& mm) {
+    dataInUse = false;


Should this be protected with lock too?

I don't think it needs to be. The point of the lock is that once it gets added to the set of unused chunks any thread may attempt to spill it to disk, but before that point it should only be accessed by a single thread.

benjaminwinger · 2024-09-18T17:59:02Z

Can u add some benchmark numbers you've done for now?

I don't have anything specific to hand; I'll update with some later, but benchmarking this is a little tricky since, with the exception of comparing to using a swapfile (which I did try once and found to be prohibitively slow by comparison) it's hard to do a meaningful comparison. When running with a restricted buffer pool size to trigger this, any dataset with a large hash index will suffer more from having to repeatedly re-read the hash index then it will from having to spill to disk. But on datasets with a small hash index I did see a fairly reasonable loss of performance the more I restricted the buffer pool size and made it rely on spilling.

We should also add free pages management in the spiller as a future TODO item. Currently, we always append to the end of the spilling file, while there should be cases we can potentially reuse pages.

Yes. In practice everything is the same size anyway, so it should be easy enough to make that explicit and maintain a list of free regions which can be written to first.

github-actions · 2024-09-19T09:48:45Z

Benchmark Result

Master commit hash: f1ee4f1ad13137a1a599ec2327f8d3eb6f3e15ab
Branch commit hash: 70f934b8ceee45be2622a00fea11541a48691aa8

Query Group	Query Name	Mean Time - Commit (ms)	Mean Time - Master (ms)	Diff
aggregation	q24	651.21	659.23	-8.02 (-1.22%)
aggregation	q28	15917.19	12069.38	3847.81 (31.88%)
filter	q14	128.55	136.93	-8.38 (-6.12%)
filter	q15	130.31	142.12	-11.81 (-8.31%)
filter	q16	302.31	314.56	-12.25 (-3.89%)
filter	q17	454.68	458.65	-3.97 (-0.87%)
filter	q18	1958.20	1971.12	-12.92 (-0.66%)
fixed_size_expr_evaluator	q07	547.72	557.28	-9.56 (-1.72%)
fixed_size_expr_evaluator	q08	764.09	762.72	1.37 (0.18%)
fixed_size_expr_evaluator	q09	767.88	766.90	0.98 (0.13%)
fixed_size_expr_evaluator	q10	242.54	255.42	-12.88 (-5.04%)
fixed_size_expr_evaluator	q11	238.12	249.86	-11.74 (-4.70%)
fixed_size_expr_evaluator	q12	238.62	247.20	-8.58 (-3.47%)
fixed_size_expr_evaluator	q13	1492.64	1486.02	6.62 (0.45%)
fixed_size_seq_scan	q23	123.68	136.17	-12.50 (-9.18%)
join	q31	12.40	11.43	0.97 (8.46%)
ldbc_snb_ic	q35	464.70	469.77	-5.07 (-1.08%)
ldbc_snb_ic	q36	41.20	40.96	0.23 (0.57%)
ldbc_snb_is	q32	8.83	8.56	0.27 (3.21%)
ldbc_snb_is	q33	15.22	14.79	0.43 (2.91%)
ldbc_snb_is	q34	8.77	8.76	0.01 (0.09%)
multi-rel	multi-rel-large-scan	1854.34	1873.21	-18.86 (-1.01%)
multi-rel	multi-rel-lookup	53.45	76.09	-22.64 (-29.75%)
multi-rel	multi-rel-small-scan	40.86	57.63	-16.77 (-29.10%)
order_by	q25	134.83	146.79	-11.96 (-8.15%)
order_by	q26	454.43	462.39	-7.96 (-1.72%)
order_by	q27	1438.76	1417.01	21.75 (1.53%)
scan_after_filter	q01	176.34	185.22	-8.89 (-4.80%)
scan_after_filter	q02	164.80	172.82	-8.02 (-4.64%)
shortest_path_ldbc100	q37	3412.32	3463.46	-51.13 (-1.48%)
shortest_path_ldbc100	q39	60.18	73.29	-13.12 (-17.90%)
shortest_path_ldbc100	q40	76.65	75.73	0.92 (1.22%)
var_size_expr_evaluator	q03	2060.59	2099.78	-39.20 (-1.87%)
var_size_expr_evaluator	q04	2276.39	2292.68	-16.29 (-0.71%)
var_size_expr_evaluator	q05	2546.83	2637.68	-90.85 (-3.44%)
var_size_expr_evaluator	q06	1390.66	1430.13	-39.47 (-2.76%)
var_size_seq_scan	q19	1469.47	1499.12	-29.65 (-1.98%)
var_size_seq_scan	q20	3166.27	3182.87	-16.60 (-0.52%)
var_size_seq_scan	q21	2434.81	2461.22	-26.41 (-1.07%)
var_size_seq_scan	q22	130.21	130.79	-0.57 (-0.44%)

benjaminwinger · 2024-09-19T10:55:37Z

Benchmarks:
Copying the first 194 million relationships in the twitter-2010 dataset (the hash index is 1.1GB)

Unrestricted: ~12GB peak memory, ~23s copy time
Buffer pool size of 8GB: ~9GB peak memory, ~25s copy time, ~2.1GB spilled
Buffer pool size of 4GB: ~6GB peak memory, ~29s copy time, ~6.8GB spilled
Buffer pool size of 3GB: ~4GB peak memory, ~30s copy time, ~8GB spilled
Buffer pool size of 2GB runs out of memory

I'm also running into a segfault in the database initialization when trying to load a database which failed to copy. It doesn't seem to be related to this PR.

github-actions · 2024-10-03T11:48:36Z

Benchmark Result

Master commit hash: fae65c679902ea73721784378e64b13421ff237a
Branch commit hash: a98545833f0acc2b772b429fe1aab669a4cc3883

Query Group	Query Name	Mean Time - Commit (ms)	Mean Time - Master (ms)	Diff
aggregation	q24	650.51	761.44	-110.93 (-14.57%)
aggregation	q28	11186.54	13960.87	-2774.33 (-19.87%)
filter	q14	126.76	130.40	-3.63 (-2.79%)
filter	q15	130.25	145.75	-15.49 (-10.63%)
filter	q16	305.86	311.83	-5.97 (-1.91%)
filter	q17	453.66	532.64	-78.98 (-14.83%)
filter	q18	1967.32	2115.02	-147.70 (-6.98%)
fixed_size_expr_evaluator	q07	544.31	548.37	-4.06 (-0.74%)
fixed_size_expr_evaluator	q08	755.61	750.63	4.98 (0.66%)
fixed_size_expr_evaluator	q09	753.74	753.58	0.15 (0.02%)
fixed_size_expr_evaluator	q10	247.57	247.81	-0.24 (-0.10%)
fixed_size_expr_evaluator	q11	239.95	245.69	-5.74 (-2.34%)
fixed_size_expr_evaluator	q12	239.25	238.00	1.25 (0.53%)
fixed_size_expr_evaluator	q13	1485.42	1503.25	-17.83 (-1.19%)
fixed_size_seq_scan	q23	121.62	135.20	-13.59 (-10.05%)
join	q29	657.79	674.37	-16.59 (-2.46%)
join	q30	1440.06	1485.53	-45.47 (-3.06%)
join	q31	10.31	12.99	-2.69 (-20.68%)
ldbc_snb_ic	q35	434.97	460.53	-25.56 (-5.55%)
ldbc_snb_ic	q36	41.20	42.29	-1.08 (-2.56%)
ldbc_snb_is	q32	9.37	9.45	-0.07 (-0.77%)
ldbc_snb_is	q33	18.89	15.72	3.18 (20.20%)
ldbc_snb_is	q34	8.60	7.94	0.66 (8.29%)
multi-rel	multi-rel-large-scan	1672.73	1898.41	-225.68 (-11.89%)
multi-rel	multi-rel-lookup	74.85	71.08	3.77 (5.30%)
multi-rel	multi-rel-small-scan	79.04	72.56	6.48 (8.93%)
order_by	q25	129.30	138.36	-9.07 (-6.55%)
order_by	q26	456.24	459.26	-3.02 (-0.66%)
order_by	q27	1415.94	1438.87	-22.93 (-1.59%)
scan_after_filter	q01	178.90	181.20	-2.29 (-1.27%)
scan_after_filter	q02	164.49	168.73	-4.24 (-2.51%)
shortest_path_ldbc100	q37	3460.09	3792.50	-332.41 (-8.76%)
shortest_path_ldbc100	q38	78.67	68.74	9.93 (14.44%)
shortest_path_ldbc100	q39	242.45	53.25	189.20 (355.29%)
shortest_path_ldbc100	q40	85.04	84.78	0.26 (0.31%)
var_size_expr_evaluator	q03	2073.86	2129.77	-55.91 (-2.63%)
var_size_expr_evaluator	q04	2282.80	2289.23	-6.43 (-0.28%)
var_size_expr_evaluator	q05	2620.16	2559.17	60.99 (2.38%)
var_size_expr_evaluator	q06	1387.44	1406.79	-19.35 (-1.38%)
var_size_seq_scan	q19	1480.81	1506.96	-26.15 (-1.74%)
var_size_seq_scan	q20	3186.58	3061.94	124.64 (4.07%)
var_size_seq_scan	q21	2452.38	2511.52	-59.13 (-2.35%)
var_size_seq_scan	q22	130.19	133.08	-2.89 (-2.17%)

(cherry picked from commit 4ba09ea)

benjaminwinger force-pushed the spill-to-disk branch 4 times, most recently from b76f823 to c6d7bf4 Compare September 10, 2024 14:55

benjaminwinger force-pushed the spill-to-disk branch 2 times, most recently from 1ae3ae2 to bb58bdb Compare September 11, 2024 17:00