-
-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tst_huge hangs with memory over 1GB. #1049
Comments
Not sure if that is the exact same one but I am getting this stacktrace when running tst-huge.so as of the commit 97fe8aa with manually applied patch to fix >1GB issue:
Same with -m1.02G |
Ok watch these two numbers in this run with 1.01G with free_initial_memory_range calls:
The fault is at 0xffff800040a0d000 which one page below the end of the memory Every time I run it same stack trace, same page fault address, same other numbers, super repeatable. |
Some kind of edge condition bug in memory::page_range_allocator? |
@wkozaczuk your last comment exactly #1050 which I opened too. In #1050 I noticed that the crash happens before the application runs - it doesn't have anything to do with tst_huge. It may have the same cause - or a different cause - I don't know. Please check out #1050. |
@nyh I did see it. But was not sure which issue it belonged. Another piece of evidence is that it started happening right after we applied my patch that changed arch_setup_free_memory() to start using memory below kernel. Which more and more means both 1049 and 1050 started appearing at the same time. I actually added some detail to the email - https://groups.google.com/d/msg/osv-dev/N5Knl4HE25o/dl27EFM-BAAJ - which kind of was related to both 1049 and 1050. Sorry I created more mess. |
This issue was discovered by @wkozaczuk and reported on the mailing list. It is likely a recent regression, or an old bug which for an unknown reason got exposed by recent changes to the memory layout.
Running tst-huge.so with 1GB of memory works fine:
However, giving it bigger memory, like the default 2GB, results in a hang.
With gdb we can see the test thread hung on a deadlock:
The deadlock happens because map_anon (frame 33) takes
vma_list_mutex.for_write()
and then we get a page fault, and page_fault (frame 13) takes the same rwlock for reading. And apparently our rwlock implementation isn't recursive. Which maybe we should reconsider, but this isn't the real bug here - the real bug is why we got this page fault in the first place.The page fault happens on address 0xffff800041dffff8:
How is this address calculated?
So the problematic address is pr_end. It is calculated by taking pr and adding to it 1066004480, which is almost one gigabyte. Does this make sense? Do we really have such a huge continguous (in physical memory...) allocation? We need to debug this further.
Another interesting observation: running tst-huge.so with -m2G or even -m1.1G fails as above. But if I run it with sizes only a tiny bit over 1G, it fails in more spectacular ways:
I'll open a new issue about this, because it happens before running the test, so it's not at all specific to tst-huge.so - although possibly it will end up that both issues stem from the same bug.
The text was updated successfully, but these errors were encountered: