Optimize atomic operations on UP (single-VCPU) #28

nyh · 2013-08-05T07:51:20Z

The run time of mutex_lock() and mutex_unlock() is dominated by a single instruction, "lock xadd", which is generated by std::atomic::fetch_add().

On a single VCPU, the "lock" prefix isn't needed. Because the host is SMP it cannot ignore this prefix, but when the guest has a single VCPU, we know this prefix is not necessary. If we drop the "lock" prefix and use the ordinary increment instruction, the mutex becomes much faster - an uncontended lock/unlock pair drops from 22ns to just 9ns. When mutexes are heavily used (e.g., in memcached they take as much as 20% of the run time), this can bring a noticable improvement.

What we should do is to remember where in the code we have the "lock" prefix (the single byte 0xf0), and when booting on a single vcpu, replace them by "nop" (0x90). Linux also has such a mechanism (see asm/alternative.h) - "LOCK_PREFIX" generates the "lock" instruction but also saves in a ".smp_locks" section the address of this lock, and any time the number of cpus grows beyond 1 or shrinks to 1, the code iterates over these locates and changes them to 0x90 or 0xf0.

Doing the above is easy if we implemented our own "fetch_add" and "compare_exchange" operation. However, currently we use C++11's std::atomic and it will be a shame to lose its advantages (like working on any processors, not just x86). Perhaps there's a solution, though: uses the GCC builtins __atomic_fetch_add and friends (see http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). So if we re-implement those, it can be enough. I tried to redefine this function and got some strange compilation errors, but maybe by re-"#define"-ing it before including , or some other ugly trick, we can force our own implementation.

A different approach we can consider (though it will probably be more complex) is to remove the lock prefix from all code in a certain function or section. This will be hard and risky, though - we need to understand where instructions begin and end, and what is code and what is not code. It will be safer if we can limit this transformation to single functions (such as lockfree_mutex_lock()) which are known not to be problematic in this regard.

asias mentioned this issue May 6, 2014

tomcat NIO connector fails #293

Closed

avikivity mentioned this issue Jul 17, 2014

Partial balloon copy failure #398

Open

raphaelsc mentioned this issue Aug 7, 2014

epoll_wait() causes aborts when socket is closed concurrently #424

Closed

vladzcloudius mentioned this issue Aug 31, 2014

net: vm_fault assert comming from route_cache::lookup() #480

Closed

nyh mentioned this issue May 28, 2015

Java image compiled on Fedora 22 doesn't work #635

Closed

yuwang888 mentioned this issue Jan 2, 2018

reschedule_from_interrupt assert(sched::exception_depth <= 1) when run specjbb2015 #933

Open

wkozaczuk added the performance label May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize atomic operations on UP (single-VCPU) #28

Optimize atomic operations on UP (single-VCPU) #28

nyh commented Aug 5, 2013

Optimize atomic operations on UP (single-VCPU) #28

Optimize atomic operations on UP (single-VCPU) #28

Comments

nyh commented Aug 5, 2013