Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64: atomic_fetchadd_int() and atomic_fetchadd_long() in machine/atomic.h sometimes yield incorrect results #1189

Closed
wkozaczuk opened this issue Apr 18, 2022 · 2 comments
Labels

Comments

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Apr 18, 2022

While trying to fix issue #1131, I have discovered another weird symptom that ultimately led me to write this issue which by the way has way more profound implications than just ZFS. In essence after implementing access flag fault handling logic as described in one of the comments of #1131, I would see the build process of a larger ZFS image hang. Then I remembered I saw similar behavior when I originally was testing ZFS support on ARM.
This time I kept digging until I ultimately pinpointed the problem to this part of the ZFS code in arc.c:

3824 int
3825 arc_tempreserve_space(uint64_t reserve, uint64_t txg)
3826 {
3827         int error;
3828         uint64_t anon_size;
....
3849         anon_size = MAX((int64_t)(arc_anon->arcs_size - arc_loaned_bytes), 0);
....
3867         if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
3868             anon_size > arc_c / 4) {
3869                 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
3870                     "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
3871                     arc_tempreserve>>10,
3872                     arc_anon->arcs_lsize[ARC_BUFC_METADATA]>>10,
3873                     arc_anon->arcs_lsize[ARC_BUFC_DATA]>>10,
3874                     reserve>>10, arc_c>>10);
3875                 return (ERESTART);
3876         }
3877         atomic_add_64(&arc_tempreserve, reserve);
3878         return (0);
3879 }

The problem would manifest in the ZFS ARC subsystem (do not understand much about it though :-) getting into a state where the anon_size (which is a function of arc_anon->arcs_size) would become so huge that the arc_tempreserve_space() function would repetitively fail with the ERESTART error (I ended up identifying this place by chasing where the ERESTART error (=symptom) was originating from). That led me to question how the arc_anon->arcs_size become so huge. To that end, I added printouts in a few places in the arc.c where the arc_anon->arcs_size (which is an atomic) is modified. And I saw this interesting pattern (just a fragment):

arc_change_state: dec arc_anon->arcs_size=17121280 by -4096
arc_change_state: NEW arc_anon->arcs_size=17117184
arc_change_state: dec arc_anon->arcs_size=17117184 by -4096
arc_change_state: NEW arc_anon->arcs_size=17113088
arc_change_state: dec arc_anon->arcs_size=17113088 by -16384
arc_change_state: NEW arc_anon->arcs_size=17096704
arc_change_state: dec arc_anon->arcs_size=17096704 by -16384
arc_change_state: NEW arc_anon->arcs_size=17080320
arc_change_state: dec arc_anon->arcs_size=17080320 by -2048
arc_change_state: NEW arc_anon->arcs_size=17078272
arc_change_state: inc arc_anon->arcs_size=17078272 by 2048
arc_change_state: NEW arc_anon->arcs_size=17080320
arc_change_state: inc arc_anon->arcs_size=17080320 by 16384
-> arc_change_state: NEW arc_anon->arcs_size=34177024 // 34177024 = 2 * 17080320  + 16384 !WRONG!
arc_change_state: dec arc_anon->arcs_size=34177024 by -16384
...
arc_change_state: NEW arc_anon->arcs_size=34201600
arc_change_state: dec arc_anon->arcs_size=34201600 by -4096
arc_change_state: NEW arc_anon->arcs_size=68399104 // 68399104 = 2 * 34201600 - 4096 !WRONG!
arc_change_state: dec arc_anon->arcs_size=68399104 by -4096
arc_change_state: NEW arc_anon->arcs_size=136794112
arc_change_state: dec arc_anon->arcs_size=136794112 by -16384 = 2 * 68399104 - 4096 !WRONG
... 
arc_change_state: NEW arc_anon->arcs_size=1094135808
arc_change_state: dec arc_anon->arcs_size=1094135808 by -2048
arc_change_state: NEW arc_anon->arcs_size=2188269568
arc_change_state: inc arc_anon->arcs_size=2188269568 by 16384
...

From what you can see above, the arcs_size gets updated most of the time correctly by delta but every so often it almost doubles until it becomes really humongous. Actually, there is a more precise pattern I noticed where the new value of arcs_size becomes two times the old value plus/minus the delta instead of just simply plus/minus delta. And this would happen on a single CPU.

The arcs_size (an 64-bit atomic) gets updated by statements like so:

atomic_add_64(&arc_anon->arcs_size, delta);

Now atomic_add_64 is actually a macro defined in bsd/aarch64/machine/atomic.h:

static __inline u_long atomic_fetchadd_long(volatile u_long *p, u_long val)
{
    u_long result;
    u_int status;
    __asm __volatile("1: ldaxr %0, %1 ; "
                     "   add   %2, %2, %0 ; "
                     "   stlxr %w3, %2, %1 ; "
                     "   cbnz  %w3, 1b ; "
                     : "=&r"(result), "+Q"(*p), "+r"(val), "=&r"(status));

    return result;
}
...
static __inline void atomic_add_long(volatile u_long *p, u_long val)
{
    (void)atomic_fetchadd_long(p, val);
}
...
#define atomic_add_64           atomic_add_long
...

After analyzing the code in arc.c and eliminating all kinds of possibilities having to do with the weak memory model, logical bugs, etc, I landed on trying to understand why we have this buggy pattern and what possibly might be causing it.
Then I focused on the atomic_fetchadd_long inlined assembly which seemed to look right. In essence, it implements the typical old-school optimistic locking strategy to atomically add a value val to some 8 bytes long variable in memory addressed by p:

  1. The ldaxr instruction loads the value from the address specified by the 2nd operand into the register specified by 1st operand with the acquire semantics and sets the exclusive monitor for the relevant fragment in memory.
  2. The add adds value in the 2nd and 3rd operand and stores in the 1st one.
  3. The stlxr stores the value from 1st operand above in the same address as in 1st line with the release semantics and stores the sucess (0) or failure (1) into the 1st operand.
  4. If the stlxr above fails it jumps back to the 1st line and tries again.

So what is wrong with this? It has become clear to me after having stared for hours (:-))) at the generated assembly:

130     static __inline void atomic_add_long(volatile u_long *p, u_long val)
131     {
132         (void)atomic_fetchadd_long(p, val);
   0x0000000000000020 <+0>:     02 fc 5f c8     ldaxr   x2, [x0]
   0x0000000000000024 <+4>:     21 00 02 8b     add     x1, x1, x2
   0x0000000000000028 <+8>:     01 fc 03 c8     stlxr   w3, x1, [x0]
   0x000000000000002c <+12>:    a3 ff ff 35     cbnz    w3, 0x20 <test_atomic_add_64(uint64_t volatile*, int64_t)>

107         return result;
   0x0000000000000030 <+16>:    c0 03 5f d6     ret

Let us run a simple example with *p containing 1000 and val equal to 16:
(A - no collision, success on 1st try, most common scenario)

  1. The ldaxr loads 1000 into x2.
  2. The add ends up storing 1016 (=1000 + 16) into x1 (x1 contained original val).
  3. The stlxr succeeds to store 1016 in the *p.
  4. The loop stops.

(A - collision, failure on 1st and success on 2nd try, rare scenario)

  1. The ldaxr loads 1000 into x2.
  2. The add ends up storing 1016 (=1000 + 16) into x1 (x1 contained original val).
  3. The stlxr fails to to store 1016 in the *p.
  4. The loop goes back to 1.
  5. The ldaxr loads 1000 into x2 again.
  6. The add adds x2 to x1 which no longer contains original val but instead the sum of two from the 1st result. So this time x1 ends up with 2016 (=1016 + 1000). !!! WRONG !!!
  7. The stlxr succeeds to store 2016 in the *p.
  8. The loop stops.

So the bug is that we are using the wrong register to store the sum that we are about to place using stlxr and we only pay for it if there is a collision triggering a retry.

This assembly should work fine:

1:
ldaxr   x2, [x0]
add     x2, x2, x1
stlxr   w3, x2, [x0]
cbnz    w3, 1b

Similarly, another inlined assembly function atomic_fetchadd_int has the exact same problem.

This bug not only affects ZFS code (beyond arc.c) but even some places in the networking stack use those macros/functions either directly or indirecly. For example, the refcount_release() uses atomic_fetchadd_int() and very likely is the real culprit of the issue #1190.

Now beyond the obvious logical bug, both atomic_fetchadd_long and atomic_fetchadd_int seem to use too strong memory order - the ldaxr/stlxr operate with the acquire/release semantics. For simple atomic counting, it should be enough to use ldxr/stxr as the current version of FreeBSD does. But maybe that should be addressed separately.

Relatedly, I have also discovered there is atomic_add_64() in bsd/sys/cddl/compat/opensolaris/kern/opensolaris_atomic.cc which does not seem to be used and due to the declarations in the bsd/sys/cddl/compat/opensolaris/sys/atomic.h being ignored by the macros of the same name in the included machine/atomic.h. But this might be a separate issue and I am also not sure of the exact intention behind the functions in opensolaris_atomic.cc.

@nyh
Copy link
Contributor

nyh commented Apr 19, 2022

Interesting but sad :-(

I think our bsd/aarch64/machine/atomic.h came directly from freebsd as it was in 2014. I wonder if FreeBSD didn't fix this bug long ago, and we can't fix it by updating to their more recent implementation.

@wkozaczuk
Copy link
Collaborator Author

Except that the atomic.h in the bsd/aarch64/machine directory does not come from FreeBSD directly. Instead, it is a copy of the x64 version of the same file in bsd/x64/machine based on my archeology (look at the author and year in the comments). It looks like our aarch64 port pre-dates the FreeBSD one - the initial commit to add bsd/aarch64/machine is from Feb 24, 2014, whereas the initial commit to add atomic.h on FreeBSD side (btw it seems to have never been under machine directory) dates back to Mar 23, 2015. I think this applies to other headers under bsd/aarch64/machine. Finally please not that OSv aarch64 port did not have the aarch64 version of musl until I added it 2 years ago so by upgrading musl (I think back then in 2014 there was simply not aarch64 version of musl).
So I think we need to look at the current version of atomic.h in FreeBSD (it looks quite different in terms of structure), compare and manually fix these as we see them wrong. For now I will just fix these two inlined assembly functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants