Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contention on vma_list_mutex during tomcat benchmark #310

Closed
tgrabiec opened this issue May 20, 2014 · 8 comments
Closed

Contention on vma_list_mutex during tomcat benchmark #310

tgrabiec opened this issue May 20, 2014 · 8 comments

Comments

@tgrabiec
Copy link
Member

I can see some slight contention on vma_list_mutex during tomcat benchmark:

trace perf-lock over 3 second period:

698.74 ms (100.00%, #4307) All
 lockfree::mutex::lock():87
 |-- 636.25 ms (91.06%, #867) lock_guard:414
 |    |-- 618.06 ms (88.45%, #508) lock_guard_for_with_lock:75
 |    |    |-- 487.12 ms (69.71%, #101) mmu::vm_fault(unsigned long, exception_frame*):1159
 |    |    |    page_fault:36
 |    |    |    ex_pf:87
 |    |    |    |-- 233.90 ms (33.47%, #52) 0x10000103720c

Trapping with GDB shows it comes from places like:

#1  0x0000000000368f05 in page_fault (ef=0xffffc0000d741038) at /data/tgrabiec/osv/arch/x64/mmu.cc:36
#2  <signal handler called>
#3  0x0000100000f7579a in Monitor::wait(bool, long, bool) ()
#4  0x000010000112eebc in VMThread::execute(VM_Operation*) ()
#5  0x0000100000fcb2f1 in ParallelScavengeHeap::mem_allocate(unsigned long, bool*) ()
#6  0x00001000010f6ee4 in typeArrayKlass::allocate_common(int, bool, Thread*) ()
#7  0x000010000103732a in OptoRuntime::new_array_C(klassOopDesc*, int, JavaThread*) ()

Which comes from code which triggers a minor collection.

@gleb-cloudius
Copy link
Contributor

On Tue, May 20, 2014 at 04:10:58AM -0700, Tomasz Grabiec wrote:

I can see some slight contention on vma_list_mutex during tomcat benchmark:

trace perf-lock over 3 second period:

698.74 ms (100.00%, #4307) All
 lockfree::mutex::lock():87
 |-- 636.25 ms (91.06%, #867) lock_guard:414
 |    |-- 618.06 ms (88.45%, #508) lock_guard_for_with_lock:75
 |    |    |-- 487.12 ms (69.71%, #101) mmu::vm_fault(unsigned long, exception_frame*):1159
 |    |    |    page_fault:36
 |    |    |    ex_pf:87
 |    |    |    |-- 233.90 ms (33.47%, #52) 0x10000103720c

Trapping with GDB shows it comes from places like:

#1  0x0000000000368f05 in page_fault (ef=0xffffc0000d741038) at /data/tgrabiec/osv/arch/x64/mmu.cc:36
#2  <signal handler called>
#3  0x0000100000f7579a in Monitor::wait(bool, long, bool) ()
#4  0x000010000112eebc in VMThread::execute(VM_Operation*) ()
#5  0x0000100000fcb2f1 in ParallelScavengeHeap::mem_allocate(unsigned long, bool*) ()
#6  0x00001000010f6ee4 in typeArrayKlass::allocate_common(int, bool, Thread*) ()
#7  0x000010000103732a in OptoRuntime::new_array_C(klassOopDesc*, int, JavaThread*) ()

Which comes from code which triggers a minor collection.

Probably the same one that Pekka sees with cassandra. Try to add
-XX:+UseMembar and see if it helps.

        Gleb.

@tgrabiec
Copy link
Member Author

That seems to reduce contention significantly, prof-lock in 10 second period:

 |    |    |-- 1.80 ms (0.95%, #7) mmu::vm_fault(unsigned long, exception_frame*)
 |    |    |    page_fault
 |    |    |    ex_pf

From:

#0  mmu::vm_fault (addr=<optimized out>, addr@entry=32440322752, ef=ef@entry=0xffffc0003b2f4038)
    at /data/tgrabiec/osv/core/mmu.cc:1160
#1  0x000000000036926a in page_fault (ef=0xffffc0003b2f4038) at /data/tgrabiec/osv/arch/x64/mmu.cc:37
#2  <signal handler called>
#3  0x0000100001012de9 in oopDesc* PSPromotionManager::copy_to_survivor_space<false>(oopDesc*) ()
#4  0x0000100001012458 in PSPromotionManager::drain_stacks_depth(bool) ()
#5  0x0000100000baa2e4 in CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int) ()
#6  0x0000100001015aeb in OldToYoungRootsTask::do_it(GCTaskManager*, unsigned int) ()
#7  0x0000100000d31c02 in GCTaskThread::run() ()
#8  0x0000100000fa9a02 in java_start(Thread*) ()

@penberg
Copy link
Contributor

penberg commented May 20, 2014

IIRC, @elcallio mentioned at some point that UseMembar has its own set of problems. Calle?

@glommer
Copy link
Contributor

glommer commented May 20, 2014

On Tue, May 20, 2014 at 4:53 PM, Tomasz Grabiec [email protected]:

That seems to reduce contention significantly, prof-lock in 10 second
period:

| | |-- 1.80 ms (0.95%, #7) mmu::vm_fault(unsigned long, exception_frame*)
| | | page_fault
| | | ex_pf

From:

#0 mmu::vm_fault (addr=, addr@entry=32440322752, ef=ef@entry=0xffffc0003b2f4038)
at /data/tgrabiec/osv/core/mmu.cc:1160
#1 0x000000000036926a in page_fault (ef=0xffffc0003b2f4038) at /data/tgrabiec/osv/arch/x64/mmu.cc:37
#2
#3 0x0000100001012de9 in oopDesc* PSPromotionManager::copy_to_survivor_space(oopDesc_) ()
#4 0x0000100001012458 in PSPromotionManager::drain_stacks_depth(bool) ()
#5 0x0000100000baa2e4 in CardTableExtension::scavenge_contents_parallel(ObjectStartArray_, MutableSpace_, HeapWord_, PSPromotionManager_, unsigned int, unsigned int) ()
#6 0x0000100001015aeb in OldToYoungRootsTask::do_it(GCTaskManager_, unsigned int) ()
#7 0x0000100000d31c02 in GCTaskThread::run() ()
#8 0x0000100000fa9a02 in java_start(Thread*) ()


Reply to this email directly or view it on GitHubhttps://github.com//issues/310#issuecomment-43621142
.

Since we control which options we pass to the JVM, I wonder what is the
effect of always passing this on? It is certainly a lot easier
than fighting our vma list lock.

@gleb-cloudius
Copy link
Contributor

On Tue, May 20, 2014 at 05:56:36AM -0700, Glauber Costa wrote:

On Tue, May 20, 2014 at 4:53 PM, Tomasz Grabiec [email protected]:

That seems to reduce contention significantly, prof-lock in 10 second
period:

| | |-- 1.80 ms (0.95%, #7) mmu::vm_fault(unsigned long, exception_frame*)
| | | page_fault
| | | ex_pf

From:

#0 mmu::vm_fault (addr=, addr@entry=32440322752, ef=ef@entry=0xffffc0003b2f4038)
at /data/tgrabiec/osv/core/mmu.cc:1160
#1 0x000000000036926a in page_fault (ef=0xffffc0003b2f4038) at /data/tgrabiec/osv/arch/x64/mmu.cc:37
#2
#3 0x0000100001012de9 in oopDesc* PSPromotionManager::copy_to_survivor_space(oopDesc_) ()
#4 0x0000100001012458 in PSPromotionManager::drain_stacks_depth(bool) ()
#5 0x0000100000baa2e4 in CardTableExtension::scavenge_contents_parallel(ObjectStartArray_, MutableSpace_, HeapWord_, PSPromotionManager_, unsigned int, unsigned int) ()
#6 0x0000100001015aeb in OldToYoungRootsTask::do_it(GCTaskManager_, unsigned int) ()
#7 0x0000100000d31c02 in GCTaskThread::run() ()
#8 0x0000100000fa9a02 in java_start(Thread*) ()


Reply to this email directly or view it on GitHubhttps://github.com//issues/310#issuecomment-43621142
.

Since we control which options we pass to the JVM, I wonder what is the
effect of always passing this on? It is certainly a lot easier
than fighting our vma list lock.

I have a patchset that drops vma_list_lock from sigsegv path. I asked to
check UseMembar just to verify that this is the same problem.

        Gleb.

@elcallio
Copy link
Contributor

So, the UseMembar flag affects how the HS JVM does thread state transitions. Normally, with the flag off, each transition will do a "psuedo-membar" by writing to a dedicated memory page, which, when stopping threads, is temporarily protected to ensure that stopper/stoppee has the same (or close enough) idea of the threads state.
UseMembar will do exactly what it says; instead, each transition will cause a membar equivalent (in the x86 case a locked add), which normally should be way more expensive, simply because the number of transitions that are not happening while threads are being stopped should be in overwhelming majority. (Note that on a single CPU system the fence is ignored however, so there it might be a good option...)

If you are seeing page faults in the transitions in a really significant amount it would seem to be an indicator that you are either GC:ing too much, or have some very bad behaviour with biased locks (which will need to stop other threads in various contended cases). That you get better results on a multi-core system with fences enabled instead is somewhat scary.

@gleb-cloudius
Copy link
Contributor

On Tue, May 20, 2014 at 06:47:30AM -0700, Calle Wilund wrote:

So, the UseMembar flag affects how the HS JVM does thread state transitions. Normally, with the flag off, each transition will do a "psuedo-membar" by writing to a dedicated memory page, which, when stopping threads, is temporarily protected to ensure that stopper/stoppee has the same (or close enough) idea of the threads state.
UseMembar will do exactly what it says; instead, each transition will cause a membar equivalent (in the x86 case a locked add), which normally should be way more expensive, simply because the number of transitions that are not happening while threads are being stopped should be in overwhelming majority. (Note that on a single CPU system the fence is ignored however, so there it might be a good option...)

If you are seeing page faults in the transitions in a really significant amount it would seem to be an indicator that you are either GC:ing too much, or have some very bad behaviour with biased locks (which will need to stop other threads in various contended cases). That you get better results on a multi-core system with fences enabled instead is somewhat scary.

From what I see there is not a lot of sigsegvs, but because sigsegv path takes
the same lock as #PF and #PF handling may sleep while holding the lock
(yeah, we need to fix that too), those that we do have are very slow.

        Gleb.

@gleb-cloudius
Copy link
Contributor

Tomek, I think we can close this one.

        Gleb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants