Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearing page cache on client machines #666

Closed
zcalusic opened this issue Nov 10, 2016 · 9 comments
Closed

Clearing page cache on client machines #666

zcalusic opened this issue Nov 10, 2016 · 9 comments
Labels

Comments

@zcalusic
Copy link
Member

I use restic to make backup of Linux system partitions. I noticed that restic is clearing page cache on my client machines. And it makes a lot of trouble in many cases. Libraries, databases, everything that was cached before restic run, needs to be pulled again from disk and brought into memory after restic backup passes.

The worst example is a machine where I have lots of RRD files. They're notorious of write storms, but at least they can be cached pretty well, so they don't have to be read from disk again and again. Now with restic installed on that machine, a combination of read/write storms ensues after restic has passed, really bad.

The culprit are the two invocations of unix.Fadvise(..., unix.FADV_DONTNEED) in src/restic/fs/file_linux.go. So, it seems that only Linux is hit with this, and why Linux, which has the best memory management of them all?

I would ask, if these calls need to stay, to at least provide a switch to turn that reall bad behavior OFF. If i buy memory to have big page cache, I'd like to use it. I hope we all know that free memory on Linux is WASTED memory. I might as well pull out those DIMM's if they're not going to be used. :( And Linux has very sophisticated algorithms for its page cache, which keep the most needed file blocks in memory and gets rid of others that are referenced only once (search for active/inactive lists). Now, restic is basically deactivating this nice mechanism with 2 superficial calls. Let's fix that!

@zcalusic
Copy link
Member Author

This is issue #666. Yeah, it's that bad. :D

@wscott
Copy link

wscott commented Nov 10, 2016

Yes FADV_DONTNEED is not what we want. It tells the kernel not only that we
don't need this piece of data again, but that no one will. For big
footprint server machines with lots of memory, this is a total performance
killer.

Do a search, bup, attic, rsync and others have all had patches to add
DONTNEED and they were either never integrated or removed later.

@fd0
Copy link
Member

fd0 commented Nov 10, 2016

I agree that at least we need to have a switch to disable fadvise for reading data.

At the moment, restic also uses fadvise for data written to a local repository, e.g. during backup. This data will usually not be needed any more, and I think the Linux kernel can figure that out. Do we need to handle this case differently?

@fd0 fd0 added category: backup type: feature enhancement improving existing features labels Nov 10, 2016
@zcalusic
Copy link
Member Author

Thanks for words of wisdom, @wscott, I really appreciate it. At some moment, it felt like I was a lone warrior for this important case. :)

@fd0, the case for backend files is definitely not that problematic (compared to client side), I have no strong opinion about that.

But, maybe still a slight preference to leave cache alone there, too. 'Cause, I've seen by monitoring restic-server debug logs that blobs are accessed right and left, basically random, and each of them accessed many times for certain commands. So, a caching effect definitely helps this case, too. Think caching index in memory, so that backup of another machine (or several) can get it faster, and so on... Linux kernel autotuning mechanisms are always there to help with dirty work (cleaning memory, etc...). And kernel will always know better what to keep, and what to throw out, because it sees the machine at the whole, while restic has no other option than to guess.

I've already disabled fadvise() calls in restic-server, and measured a small improvement in both CPU utilization, and access time when doing repeated backups.

@wscott
Copy link

wscott commented Nov 10, 2016

I believe you will have people who will run incremental backups hourly (or
sooner).
Or machines that are dedicated backup machines. In both cases flushing your
data from the cache is probably a bad idea.

Just leave the kernel to do its thing.

@fd0 fd0 mentioned this issue Nov 10, 2016
@fd0
Copy link
Member

fd0 commented Nov 10, 2016

I've decided to remove the fadvise code completely (see #670). Turns out, @zcalusic is completely right, and the Linux kernel is indeed much better at managing the fs cache.

Without the fadvise code, restic performs well, and as far as I can see the data that is read during backup and written to a (local) repo is only added to the Inactive File Page Cache, which is easily purged when new memory is required (in contrast to the Active File Page Cache).

I also investigated what happened in the past that led me on this path to advising the kernel what to do with the file cache:

At the beginning of the project I decided to implement the chunking code (which splits files into chunks based on the content) in a way that would read a file and return the next chunk, but without retaining the data. Later on, restic read the data from the file according to the offsets of the chunks again. This meant that the kernel now sees two read requests for all data that is to be backed up, so it decides that this must be especially important data. So it is cached in the Active File Page Cache, which slowly fills up. This was the behavior I saw and which other users reported. Apparently, at the time of writing the machines I tested this on reacted in a way that other memory was swapped out in order to make more room for the Active File Page Cache. This caused problems, as you can imagine.

After a while I figured out that reading data twice is a bad idea, mostly because the data may have changed between computing the chunk boundaries and saving the data to the repo. Retaining the just-read data after computing the chunks and ID was implemented in 77d85ce in February 2016. Shortly after in March I added the code to use fadvise to drop the file cache (008337a). My fault was probably to not check again which page cache type the kernel uses for the data, I only used htop and saw that the cache grew.

So, long story short, the fadvise code will be gone shortly. I've learned a lot (especially: do more thorough research and benchmarks). Thanks @zcalusic for pointing it out and being so persistent. :)

@fd0 fd0 closed this as completed in #670 Nov 10, 2016
@zcalusic
Copy link
Member Author

Thank you for taking it into consideration and a quick fix!

And apologies if I was too persistent at times, it was always with good intentions. :)

The thing is, restic is way too nice of a project to let it do stuff poorly, even in corner cases.

Thank you for all your great work on it, and of course for sharing it with us. 👍

@fd0
Copy link
Member

fd0 commented Nov 11, 2016

I've just discovered https://github.com/Feh/nocache, which people can use to minimize cache effects for programs. Unfortunately, it relies on LD_PRELOAD to hook libc function calls, so it doesn't work for Go (which does not use libc).

@zcalusic
Copy link
Member Author

Yeah, right before submitting ticket, I tried to solve the issue with LD_PRELOAD trick. But, as you correctly notice, it unfortunately failed for Go program. That's when I become desperate. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants