-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interactions with MGLRU #122
Comments
Hi @LifeIsStrange |
Well feel free to either: 1) do not take the energy to answer, as you do not owe me anything (and I actually owe you many avoided complete system freeze induced by Intellij Idea :)) or 2) give a quick and approximate/non-perfect answer or 3) write an entire article LOL |
I will try to gradually give responses, as far as I have enough strength. |
Any times, no worries it's no big deal :) |
MGLRU is a revolutionary change. Revolutionary, because it changes the basics - the basics of the vmscan and vm shrinking. I confirm that MGLRU allows you to increase performance under memory pressure. I've seen this happen in at least one of my experiences: I've filled a list with random numbers until about half (or more) the swap space (swap on zram) is full, then overwritten the elements of the list with the new values [1]. This is a CPU-bound task that runs under memory pressure. In swap-intensive and energy-intensive tasks, indeed, 15-30% less energy consumption by To be continued (I have not announced the concerns yet).
|
Thanks, great answer :) BTW this russian forum looks very interesting thx for sharing!
@hakavlad A cache is a performance critical algorithm that is relatively simple to change and has a tremendous impact. http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html I'm pretty sure someone could very easily make a C implementation out of this C++ implementation |
@mandreyel friendly ping if you think my last comment make sense, you might be able to make an extremely significant contribution to the linux kernel, that would increase Linux throughput while under memory/swap pressure and simultaneously reduce energy consumption for billions of users, for the century
@ben-manes sorry I couldn't resist pinging you :) cf my last comment. TL;DR this is about replacing the linux kernel LRU |
A few years back, I looked into Linux's page cache, mm/workingset.c and tried to simulate their policy, DClock. The policy's configuration is a guess based on the total system memory and is pretty much arbitrary from discussing with the authors. They believe anyone who has paging issues can add more memory, so they only cared about and optimized for when the page cache is already 90%+ effective. The simulations show that there is a healthy room for improvement. The authors of MGLRU say that "Multigeneration lru is not related to mm/workingset.c. The filename might be a bit misleading because it doesn't really track resident set size (in memory), only a recent eviction history." I'm not sure what the relationship is between the two policies. The Linux code is scattered and confusing. |
Very interesting @ben-manes !
I have unfortunatelo no clue :m, and were are not alone |
LIRS is an excellent algorithm but the papers are very confusing, making it difficult to write a correct implementation. It is not a concurrent policy, but their ClockPro algorithm is. There are LWN articles on their attempts at ClockPro, but it seemed a lack of a champion and code complexity led to them inventing their own inspired policy. The code maintainability is more important than its performance, so their choices make sense given what they were comfortable with. In none of their attempts did they seem to try and simulate beforehand, so code dumps from transcribing the reference material does seem very spooky. Caffeine avoids the concurrency problems by using a lesser known trick of replay buffers (also from the LIRS authors). That makes it much simpler to reason about and optimize the algorithms for a single-threaded context. The choice of W-TinyLFU vs LIRS was mostly a toss-up where maintainability made the choice. LIRS was very confusing to write/debug and harder to fix it deficiencies, whereas improving TinyLFU appeared to be straightforward engineering iterations. (Those problems were later fixed by the authors in LIRS2). The Java simulator has my reimplementations and the various papers are fun to read. A lot of design choices comes down to what the author is comfortable to support, so if the algorithm is too magical or poorly described then it won't get much attention. Caffeine's approaches are easy to explain, implement, and debug, so it's gotten more adoption. |
Great comment, as always. off-topic: digression
Well I have to somewhat disagree on that here, generally yes but it shouldn't be a false dilemma, the linux kernel should receive enough funding/human resources to maintain a SOTA cache implementation (doesn't even have to be too complex as Caffeine show but it could and should be afforded), this artificial scarcity of engineering talent (very few engineers read academic papers like you do or are fluent in cpu intrinsics/SIMD) is as said, artificial. It is a cultural issue and a hiring issue. All GAFAs depends on the critical hot paths of the kernel and the linux foundation has a $177 Million dollar budget of which only a mere 3% are allocated to the kernel. So as said, it is an artificial, accidental scarcity that affects ends users, but this phenomenon happens everywhere (major advances in NLP are ignored thanks to NIH syndrome, in medecine many diseases including the aging process have found potent treatments since the 90s and keep being ignored, etc etc) off-topic: digression giga-off-topic: |
You can try read the paper and source code. The paper wasn't detailed enough and I did not try to port their C++ code to the Java simulator. It would be nice to have that, though, as there should be useful insights to learn from. At the time I ran a few of my tests through it after using the rewriter utility to convert formats. That led to their last commit as it was originally unusable due to slow pruning. From the email thread with the authors, it matched DS1 and sounded like I found iti competitive in most workloads. Unfortunately it failed on my adaptive stress test. They claimed to have a workload where Caffeine underperformed, but were bound by an NDA and the referenced holders refused to grant me access unfortunately. The ML-based policies are very slow, at best measures per hit operations in microseconds rather than a few nanoseconds. They do not test against a wide range of data, compare against modern non-ML policies, and have high resource overhead. In general they need to be trained which makes them brittle to application evolution. Given the complexity, lack of code, and targeting a non-engineering community, I don't have much of an opinion on their success. In that HN thread I mentioned the RC-Cache paper which is the closest to TinyLFU and seems promising. In that case I think it would be an NN showing one can make an admission policy smarter (e.g. latency aware), and then a simpler algorithmic solution could be found that would likely outperform it. To me ML caches are a good way to find out if the computer can brute force to a smarter algorithm and then let engineers design one, but are not what engineers would deploy directly. |
I can call @yuzhaogoogle. @yuzhaogoogle What do you think about the possibility of replacing LRU with W-TinyLFU on Linux?
@LifeIsStrange It's easy to ask any questions at linux-mm and LKML. Feel free to do it!
I'm not a programmer. I would like to propose this mission to @yuzhaogoogle. Probably Yu Zhao can explain why this is a bad or hard to implement idea. |
It's not crazy if we have set up a fresh environment just to rerun those CPU intensive benchmarks with the latest kernel/patchset and found no actual regressions. |
I don't want to discourage anybody, but to be honest, it's unrealistic given the amount of engineering effort required. |
Hi @hakavlad I have always much appreciated your various works for improving memory usage/system responsiveness (latency) or temporal throughput.
That is why I was wondering what are your thoughts and maybe concerns about the MGLRU work?
https://www.phoronix.com/scan.php?page=news_item&px=MGLRU-July-2022-Performance
As you can see from this slide
https://www.phoronix.net/image.php?id=2022&image=mglru_v6
Are all of their claims true? I mean I doubt they compared their solution vs combinations of your user space solutions.
Also I was wondering wether you'd be interested in developing or contributing to kernel space solutions to the memory problem? Maybe you have feedback for them or competing ideas?
Most importantly I was wondering about the interactions and wether MGLRU will make nohang and your other solutions (which ones) obscolete for newer kernels with MGLRU?
Overall this seems like a great development and should improve the situation on the Linux desktop :) (as unfortunately distros were too mediocre to integrate your solutions for the mainstream)
Note: crazy that they claim of no known regression https://openbenchmarking.org/embed.php?i=2206299-NE-MGLRUBENC78&sha=82868b37d91bc50bf73b05b09e80a54e5877c702&p=2
The text was updated successfully, but these errors were encountered: