-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spill to disk #4188
Spill to disk #4188
Conversation
b76f823
to
c6d7bf4
Compare
A bit of early comments here based on what I've played around with so for.
|
1ae3ae2
to
bb58bdb
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4188 +/- ##
========================================
Coverage 88.20% 88.21%
========================================
Files 1345 1347 +2
Lines 52544 52877 +333
Branches 6987 7025 +38
========================================
+ Hits 46348 46645 +297
- Misses 6024 6058 +34
- Partials 172 174 +2 ☔ View full report in Codecov by Sentry. |
This comment was marked as outdated.
This comment was marked as outdated.
bb58bdb
to
0f7a033
Compare
This comment was marked as outdated.
This comment was marked as outdated.
0f7a033
to
163ec70
Compare
Done.
I've set up the framework to pass this through the buffermanager, but how should it be configured?
At the very least it would have to be manually with the current defaults since the would be no database directory to store it in. The only benefit I can see is that after compression the size may be smaller than during the copy. But it does seem to me like the easiest way to enable spilling to disk if you're using in-memory mode and running out of memory would be to switch to an on-disk database. |
163ec70
to
397c4b9
Compare
This comment was marked as outdated.
This comment was marked as outdated.
397c4b9
to
901acf4
Compare
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can u add some benchmark numbers you've done for now?
We should also add free pages management in the spiller as a future TODO item. Currently, we always append to the end of the spilling file, while there should be cases we can potentially reuse pages.
// support multiple databases spilling at once (can't be the same file), and handle different | ||
// platforms. | ||
if (!main::DBConfig::isDBPathInMemory(databasePath) && | ||
dynamic_cast<LocalFileSystem*>(vfs->findFileSystem(spillToDiskPath))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to add an interface isSpillable()
to vfs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. But when would we want to use this feature for another type of filesystem?
I'm also not sure how we would describe it. isSpillable
sounds more like a question of whether it's possible, but we're not skipping this for remote filesystems just because it's not possible, but because it's impractical since the performance is poor.
On the other hand maybe it's not impractical if you're using fast network storage on your local network, so maybe there's an argument to allowing any filesystem when not using the default path (though I don't know why you wouldn't use something like NFS to treat it as a local filesystem in that case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i didn't make myself clear here. To clarify a bit more, I mainly meant if that is a better way to check if the vfs is local or not compared to the check using dynamic_cast
. maybe the function should be isLocalFS()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the virtual call would actually be faster, but I'm not sure if it's really any better in general, particularly since this just runs once and I doubt the difference is significant (which I think it would only be if the inheritance tree is really large).
|
||
void ChunkedNodeGroup::loadFromDisk(MemoryManager& mm) { | ||
mm.getBufferManager()->getSpillerOrSkip([&](auto& spiller) { | ||
std::unique_lock lock{spillToDiskMutex}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to assert on dataInUse
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if the chunk hasn't already been spilled to disk it's possible for it to be spilled to disk as this is running, if another thread needs the memory.
dataInUse
could actually be true here since the last chunked group in each collection won't ever get marked as unused.
@@ -461,5 +464,36 @@ std::unique_ptr<ChunkedNodeGroup> ChunkedNodeGroup::deserialize(MemoryManager& m | |||
return chunkedGroup; | |||
} | |||
|
|||
void ChunkedNodeGroup::setUnused(MemoryManager& mm) { | |||
dataInUse = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be protected with lock too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it needs to be. The point of the lock is that once it gets added to the set of unused chunks any thread may attempt to spill it to disk, but before that point it should only be accessed by a single thread.
901acf4
to
248c015
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I don't have anything specific to hand; I'll update with some later, but benchmarking this is a little tricky since, with the exception of comparing to using a swapfile (which I did try once and found to be prohibitively slow by comparison) it's hard to do a meaningful comparison. When running with a restricted buffer pool size to trigger this, any dataset with a large hash index will suffer more from having to repeatedly re-read the hash index then it will from having to spill to disk. But on datasets with a small hash index I did see a fairly reasonable loss of performance the more I restricted the buffer pool size and made it rely on spilling.
Yes. In practice everything is the same size anyway, so it should be easy enough to make that explicit and maintain a list of free regions which can be written to first. |
6c23eb9
to
72692ea
Compare
Benchmark ResultMaster commit hash:
|
Benchmarks:
I'm also running into a segfault in the database initialization when trying to load a database which failed to copy. It doesn't seem to be related to this PR. |
4c95d6c
to
ca4b137
Compare
fe1c020
to
a07e2aa
Compare
a07e2aa
to
fe1d109
Compare
Benchmark ResultMaster commit hash:
|
(cherry picked from commit 4ba09ea)
Based on #3743
This feature allows Column chunk data from the partitioner during rel table copies to be written to disk and freed (to save memory) if the ChunkedNodeGroup in question will no longer be needed until the end of the copy.
Every full ChunkedNodeGroup (in the InMemChunkedGroupCollection) gets passed to the Spiller, which stores them in a locked unordered_set (so they can be quickly removed once they are needed again; we could probably use an EvictionQueue to store them, but I'm not convinced that performance benefit would be noticeable).
Data starts to be spilled to disk when the buffer pool is full and at least 50% of the buffer pool is used by column chunks (though this calculation includes chunks which aren't in the fullPartitionerGroups set and can't be spilled).
When it comes time for the data spilled to be used, it gets loaded back into memory and removed from the fullPartitionerGroups set (which could cause other chunks to be spilled if there is not enough memory; ideally we could make sure to use all the chunks already in memory first, but that would be complicated). The memory is then freed afterwards (originally all the chunks would be held in memory until the end of RelBatchInsert execution, but that's no longer possible so we now need mutable access to that data and move it out of the partitioner shared state when processing it so that it can be freed).
Future work
This is currently only enabled when the database is using the LocalFileSystem and it is not in in-memory mode. I feel like spilling to disk should not write to other types of filesystems, but it seems reasonable that this could spill to the local filesystem for databases stored on other filesystems. That will be more complicated though, and I don't know if there will be much demand for it.