You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to fix issue #1131, I have discovered another weird symptom that ultimately led me to write this issue which by the way has way more profound implications than just ZFS. In essence after implementing access flag fault handling logic as described in one of the comments of #1131, I would see the build process of a larger ZFS image hang. Then I remembered I saw similar behavior when I originally was testing ZFS support on ARM.
This time I kept digging until I ultimately pinpointed the problem to this part of the ZFS code in arc.c:
The problem would manifest in the ZFS ARC subsystem (do not understand much about it though :-) getting into a state where the anon_size (which is a function of arc_anon->arcs_size) would become so huge that the arc_tempreserve_space() function would repetitively fail with the ERESTART error (I ended up identifying this place by chasing where the ERESTART error (=symptom) was originating from). That led me to question how the arc_anon->arcs_size become so huge. To that end, I added printouts in a few places in the arc.c where the arc_anon->arcs_size (which is an atomic) is modified. And I saw this interesting pattern (just a fragment):
arc_change_state: dec arc_anon->arcs_size=17121280 by -4096
arc_change_state: NEW arc_anon->arcs_size=17117184
arc_change_state: dec arc_anon->arcs_size=17117184 by -4096
arc_change_state: NEW arc_anon->arcs_size=17113088
arc_change_state: dec arc_anon->arcs_size=17113088 by -16384
arc_change_state: NEW arc_anon->arcs_size=17096704
arc_change_state: dec arc_anon->arcs_size=17096704 by -16384
arc_change_state: NEW arc_anon->arcs_size=17080320
arc_change_state: dec arc_anon->arcs_size=17080320 by -2048
arc_change_state: NEW arc_anon->arcs_size=17078272
arc_change_state: inc arc_anon->arcs_size=17078272 by 2048
arc_change_state: NEW arc_anon->arcs_size=17080320
arc_change_state: inc arc_anon->arcs_size=17080320 by 16384
-> arc_change_state: NEW arc_anon->arcs_size=34177024 // 34177024 = 2 * 17080320 + 16384 !WRONG!
arc_change_state: dec arc_anon->arcs_size=34177024 by -16384
...
arc_change_state: NEW arc_anon->arcs_size=34201600
arc_change_state: dec arc_anon->arcs_size=34201600 by -4096
arc_change_state: NEW arc_anon->arcs_size=68399104 // 68399104 = 2 * 34201600 - 4096 !WRONG!
arc_change_state: dec arc_anon->arcs_size=68399104 by -4096
arc_change_state: NEW arc_anon->arcs_size=136794112
arc_change_state: dec arc_anon->arcs_size=136794112 by -16384 = 2 * 68399104 - 4096 !WRONG
...
arc_change_state: NEW arc_anon->arcs_size=1094135808
arc_change_state: dec arc_anon->arcs_size=1094135808 by -2048
arc_change_state: NEW arc_anon->arcs_size=2188269568
arc_change_state: inc arc_anon->arcs_size=2188269568 by 16384
...
From what you can see above, the arcs_size gets updated most of the time correctly by delta but every so often it almost doubles until it becomes really humongous. Actually, there is a more precise pattern I noticed where the new value of arcs_size becomes two times the old value plus/minus the delta instead of just simply plus/minus delta. And this would happen on a single CPU.
The arcs_size (an 64-bit atomic) gets updated by statements like so:
atomic_add_64(&arc_anon->arcs_size, delta);
Now atomic_add_64 is actually a macro defined in bsd/aarch64/machine/atomic.h:
After analyzing the code in arc.c and eliminating all kinds of possibilities having to do with the weak memory model, logical bugs, etc, I landed on trying to understand why we have this buggy pattern and what possibly might be causing it.
Then I focused on the atomic_fetchadd_long inlined assembly which seemed to look right. In essence, it implements the typical old-school optimistic locking strategy to atomically add a value val to some 8 bytes long variable in memory addressed by p:
The ldaxr instruction loads the value from the address specified by the 2nd operand into the register specified by 1st operand with the acquire semantics and sets the exclusive monitor for the relevant fragment in memory.
The add adds value in the 2nd and 3rd operand and stores in the 1st one.
The stlxr stores the value from 1st operand above in the same address as in 1st line with the release semantics and stores the sucess (0) or failure (1) into the 1st operand.
If the stlxr above fails it jumps back to the 1st line and tries again.
So what is wrong with this? It has become clear to me after having stared for hours (:-))) at the generated assembly:
Let us run a simple example with *p containing 1000 and val equal to 16:
(A - no collision, success on 1st try, most common scenario)
The ldaxr loads 1000 into x2.
The add ends up storing 1016 (=1000 + 16) into x1 (x1 contained original val).
The stlxr succeeds to store 1016 in the *p.
The loop stops.
(A - collision, failure on 1st and success on 2nd try, rare scenario)
The ldaxr loads 1000 into x2.
The add ends up storing 1016 (=1000 + 16) into x1 (x1 contained original val).
The stlxr fails to to store 1016 in the *p.
The loop goes back to 1.
The ldaxr loads 1000 into x2 again.
The add adds x2 to x1 which no longer contains original val but instead the sum of two from the 1st result. So this time x1 ends up with 2016 (=1016 + 1000). !!! WRONG !!!
The stlxr succeeds to store 2016 in the *p.
The loop stops.
So the bug is that we are using the wrong register to store the sum that we are about to place using stlxr and we only pay for it if there is a collision triggering a retry.
Similarly, another inlined assembly function atomic_fetchadd_int has the exact same problem.
This bug not only affects ZFS code (beyond arc.c) but even some places in the networking stack use those macros/functions either directly or indirecly. For example, the refcount_release() uses atomic_fetchadd_int() and very likely is the real culprit of the issue #1190.
Now beyond the obvious logical bug, both atomic_fetchadd_long and atomic_fetchadd_int seem to use too strong memory order - the ldaxr/stlxr operate with the acquire/release semantics. For simple atomic counting, it should be enough to use ldxr/stxr as the current version of FreeBSD does. But maybe that should be addressed separately.
Relatedly, I have also discovered there is atomic_add_64() in bsd/sys/cddl/compat/opensolaris/kern/opensolaris_atomic.cc which does not seem to be used and due to the declarations in the bsd/sys/cddl/compat/opensolaris/sys/atomic.h being ignored by the macros of the same name in the included machine/atomic.h. But this might be a separate issue and I am also not sure of the exact intention behind the functions in opensolaris_atomic.cc.
The text was updated successfully, but these errors were encountered:
I think our bsd/aarch64/machine/atomic.h came directly from freebsd as it was in 2014. I wonder if FreeBSD didn't fix this bug long ago, and we can't fix it by updating to their more recent implementation.
Except that the atomic.h in the bsd/aarch64/machine directory does not come from FreeBSD directly. Instead, it is a copy of the x64 version of the same file in bsd/x64/machine based on my archeology (look at the author and year in the comments). It looks like our aarch64 port pre-dates the FreeBSD one - the initial commit to add bsd/aarch64/machine is from Feb 24, 2014, whereas the initial commit to add atomic.h on FreeBSD side (btw it seems to have never been under machine directory) dates back to Mar 23, 2015. I think this applies to other headers under bsd/aarch64/machine. Finally please not that OSv aarch64 port did not have the aarch64 version of musl until I added it 2 years ago so by upgrading musl (I think back then in 2014 there was simply not aarch64 version of musl).
So I think we need to look at the current version of atomic.h in FreeBSD (it looks quite different in terms of structure), compare and manually fix these as we see them wrong. For now I will just fix these two inlined assembly functions.
While trying to fix issue #1131, I have discovered another weird symptom that ultimately led me to write this issue which by the way has way more profound implications than just ZFS. In essence after implementing access flag fault handling logic as described in one of the comments of #1131, I would see the build process of a larger ZFS image hang. Then I remembered I saw similar behavior when I originally was testing ZFS support on ARM.
This time I kept digging until I ultimately pinpointed the problem to this part of the ZFS code in
arc.c
:The problem would manifest in the ZFS ARC subsystem (do not understand much about it though :-) getting into a state where the
anon_size
(which is a function ofarc_anon->arcs_size
) would become so huge that thearc_tempreserve_space()
function would repetitively fail with theERESTART
error (I ended up identifying this place by chasing where theERESTART
error (=symptom) was originating from). That led me to question how thearc_anon->arcs_size
become so huge. To that end, I added printouts in a few places in thearc.c
where thearc_anon->arcs_size
(which is an atomic) is modified. And I saw this interesting pattern (just a fragment):From what you can see above, the
arcs_size
gets updated most of the time correctly by delta but every so often it almost doubles until it becomes really humongous. Actually, there is a more precise pattern I noticed where the new value ofarcs_size
becomes two times the old value plus/minus the delta instead of just simply plus/minus delta. And this would happen on a single CPU.The
arcs_size
(an 64-bit atomic) gets updated by statements like so:Now
atomic_add_64
is actually a macro defined inbsd/aarch64/machine/atomic.h
:After analyzing the code in
arc.c
and eliminating all kinds of possibilities having to do with the weak memory model, logical bugs, etc, I landed on trying to understand why we have this buggy pattern and what possibly might be causing it.Then I focused on the
atomic_fetchadd_long
inlined assembly which seemed to look right. In essence, it implements the typical old-school optimistic locking strategy to atomically add a valueval
to some 8 bytes long variable in memory addressed byp
:ldaxr
instruction loads the value from the address specified by the 2nd operand into the register specified by 1st operand with the acquire semantics and sets the exclusive monitor for the relevant fragment in memory.add
adds value in the 2nd and 3rd operand and stores in the 1st one.stlxr
stores the value from 1st operand above in the same address as in 1st line with the release semantics and stores the sucess (0) or failure (1) into the 1st operand.stlxr
above fails it jumps back to the 1st line and tries again.So what is wrong with this? It has become clear to me after having stared for hours (:-))) at the generated assembly:
Let us run a simple example with
*p
containing 1000 andval
equal to 16:(A - no collision, success on 1st try, most common scenario)
ldaxr
loads 1000 into x2.add
ends up storing 1016 (=1000 + 16) into x1 (x1 contained originalval
).stlxr
succeeds to store 1016 in the*p
.(A - collision, failure on 1st and success on 2nd try, rare scenario)
ldaxr
loads 1000 into x2.add
ends up storing 1016 (=1000 + 16) into x1 (x1 contained originalval
).stlxr
fails to to store 1016 in the*p
.ldaxr
loads 1000 into x2 again.add
adds x2 to x1 which no longer contains originalval
but instead the sum of two from the 1st result. So this time x1 ends up with 2016 (=1016 + 1000). !!! WRONG !!!stlxr
succeeds to store 2016 in the*p
.So the bug is that we are using the wrong register to store the sum that we are about to place using
stlxr
and we only pay for it if there is a collision triggering a retry.This assembly should work fine:
Similarly, another inlined assembly function
atomic_fetchadd_int
has the exact same problem.This bug not only affects ZFS code (beyond
arc.c
) but even some places in the networking stack use those macros/functions either directly or indirecly. For example, therefcount_release()
usesatomic_fetchadd_int()
and very likely is the real culprit of the issue #1190.Now beyond the obvious logical bug, both
atomic_fetchadd_long
andatomic_fetchadd_int
seem to use too strong memory order - theldaxr
/stlxr
operate with the acquire/release semantics. For simple atomic counting, it should be enough to useldxr
/stxr
as the current version of FreeBSD does. But maybe that should be addressed separately.Relatedly, I have also discovered there is
atomic_add_64()
inbsd/sys/cddl/compat/opensolaris/kern/opensolaris_atomic.cc
which does not seem to be used and due to the declarations in thebsd/sys/cddl/compat/opensolaris/sys/atomic.h
being ignored by the macros of the same name in the includedmachine/atomic.h
. But this might be a separate issue and I am also not sure of the exact intention behind the functions inopensolaris_atomic.cc
.The text was updated successfully, but these errors were encountered: