Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCAN, SSCAN, HSCAN, and ZSCAN are not stable across failovers #4

Open
madolson opened this issue Mar 22, 2024 · 3 comments
Open

SCAN, SSCAN, HSCAN, and ZSCAN are not stable across failovers #4

madolson opened this issue Mar 22, 2024 · 3 comments

Comments

@madolson
Copy link
Member

Scan is supposed to provide the following guarantees (as per https://redis.io/docs/manual/keyspace/):

A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration. This means that if a given element is inside the collection when an iteration is started, and is still there when an iteration terminates, then at some point SCAN returned it to the user.

A full iteration never returns any element that was NOT present in the collection from the start to the end of a full iteration. So if an element was removed before the start of an iteration, and is never added back to the collection for all the time an iteration lasts, SCAN ensures that this element will never be returned.

While playing around with some PoC for cluster wide scan, I realized that the first guarantee can only hold as long as you were connected to the same node during the entire duration of the scanning. If failover occurs, the replica may have a different seed value for the siphash function, so the cursor previously used on the primary would not be in the same place. This is likely less of an issue for SCAN since you normally indicate the node you're talking to, but many CME client transparently handle re-directs for SSCAN/HSCAN/ZSCAN during failovers.

I'm not sure we need to strictly do anything about SCAN, I'm not sure how often SCAN is resumed after a failover. The other commands might though.

I think this is worth addressing, but could be done in three ways:

Simply update the documentation to add the caveat. This doesn't feel right to me, because most users will not really be aware of this.
Add a configuration so that seeds can be set externally. This would allow operators to configure this consistency, but has limited benefit for those that don't know.
Allow replicas to sync their data seed from their primaries. This makes their cursors consistent. We can also persistent this into RDB so that it is still accurate, I'm most in-favor of this.

ref: redis/redis#12440

We never came up with a cohesive answer here

@madolson
Copy link
Member Author

There is a second issue that I want to make sure we don't drop, which is the cursor format is changing for SCAN for Redis 8 with the new per-slot dictionaries. The format is changing from <64 bits for DB cursor> to 00<14 bits for slot><48 bits for DB cursor>. To be specific, we aren't actually showing the bits, just the integer representation of those bits. We should be able to detect versioning issues, i.e. using the cursor from 7.0 node on an 8.0 node, so I would like to propose we update it to one of the following two options:

<2 version bits><14 bits for slot><48 bits for DB cursor>, we will bump from version 00 -> 01. This is likely the most backwards compatible. It also allows us in the future to do a third version if we want to re-organize the cursor bits. Going to more than 2 bits introduces the risk there are users are storing the cursor as a long long, and it will break.
-<14 bits for slot><48 bits for DB cursor>. The version will initially be 1, and we'll use the - to detect the new format. This gives us the most freedom to change the version more in the future.
@yossigo, I wasn't able to find our decision from the previous meeting. I know you were concerned with the format, which we agreed we should look into fixing, but I don't recall if you also wanted to try to make the SCAN command stable across failover. (I still want to try to make HSCAN, ZSCAN, and SSCAN stable) So that it might return duplicate items, but won't omit items because of the cursor shift.

@zuiderkwast
Copy link
Contributor

This is a great topic to finish. So it's actually three things mentioned here:

  1. Scan across failovers. Synching the hashtable seed. We can do that in the RDB or in another way when initiating replication or before. The seed may be sensitive security information though, but if anyone has permission to replicate, I think it's fine to let them fetch the seed too.
  2. Cluster-wide scan. Per slot, either explicitly or implicitly. You were against exposing the slot to the user, while others liked it.
  3. Cursor versioning. I don't see how you can use these bits though. Are these two LSB unused in an old cursor?

naglera pushed a commit to naglera/placeholderkv that referenced this issue Apr 8, 2024
…is missed cases to redis-server. (#12322)

Observed that the sanitizer reported memory leak as clean up is not done
before the process termination in negative/following cases:

**- when we passed '--invalid' as option to redis-server.**

```
 -vm:~/mem-leak-issue/redis$ ./src/redis-server --invalid

*** FATAL CONFIG FILE ERROR (Redis 255.255.255) ***
Reading the configuration file, at line 2
>>> 'invalid'
Bad directive or wrong number of arguments

=================================================================
==865778==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x7f0985f65867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    valkey-io#1 0x558ec86686ec in ztrymalloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:117
    valkey-io#2 0x558ec86686ec in ztrymalloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:135
    valkey-io#3 0x558ec86686ec in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:276
    valkey-io#4 0x558ec86686ec in zrealloc /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:327
    valkey-io#5 0x558ec865dd7e in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1172
    valkey-io#6 0x558ec87a1be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472
    valkey-io#7 0x558ec87a13b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718
    valkey-io#8 0x558ec85e6f15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258
    valkey-io#9 0x7f09856e5d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 8 byte(s) leaked in 1 allocation(s).

```

**- when we pass '--port' as option and missed to add port number to redis-server.**

```
vm:~/mem-leak-issue/redis$ ./src/redis-server --port

*** FATAL CONFIG FILE ERROR (Redis 255.255.255) ***
Reading the configuration file, at line 2
>>> 'port'
wrong number of arguments

=================================================================
==865846==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x7fdcdbb1f867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    valkey-io#1 0x557e8b04f6ec in ztrymalloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:117
    valkey-io#2 0x557e8b04f6ec in ztrymalloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:135
    valkey-io#3 0x557e8b04f6ec in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:276
    valkey-io#4 0x557e8b04f6ec in zrealloc /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:327
    valkey-io#5 0x557e8b044d7e in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1172
    valkey-io#6 0x557e8b188be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472
    valkey-io#7 0x557e8b1883b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718
    valkey-io#8 0x557e8afcdf15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258
    valkey-io#9 0x7fdcdb29fd8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Indirect leak of 10 byte(s) in 1 object(s) allocated from:
    #0 0x7fdcdbb1fc18 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164
    valkey-io#1 0x557e8b04f9aa in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:287
    valkey-io#2 0x557e8b04f9aa in ztryrealloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:317
    valkey-io#3 0x557e8b04f9aa in zrealloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:342
    valkey-io#4 0x557e8b033f90 in _sdsMakeRoomFor /home/ubuntu/mem-leak-issue/redis/src/sds.c:271
    valkey-io#5 0x557e8b033f90 in sdsMakeRoomFor /home/ubuntu/mem-leak-issue/redis/src/sds.c:295
    valkey-io#6 0x557e8b033f90 in sdscatlen /home/ubuntu/mem-leak-issue/redis/src/sds.c:486
    valkey-io#7 0x557e8b044e1f in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1165
    valkey-io#8 0x557e8b188be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472
    valkey-io#9 0x557e8b1883b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718
    valkey-io#10 0x557e8afcdf15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258
    valkey-io#11 0x7fdcdb29fd8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 18 byte(s) leaked in 2 allocation(s).

```

As part analysis found that the sdsfreesplitres is not called when this condition checks are being hit.

Output after the fix:


```
vm:~/mem-leak-issue/redis$ ./src/redis-server --invalid

*** FATAL CONFIG FILE ERROR (Redis 255.255.255) ***
Reading the configuration file, at line 2
>>> 'invalid'
Bad directive or wrong number of arguments
vm:~/mem-leak-issue/redis$

===========================================
vm:~/mem-leak-issue/redis$ ./src/redis-server --jdhg

*** FATAL CONFIG FILE ERROR (Redis 255.255.255) ***
Reading the configuration file, at line 2
>>> 'jdhg'
Bad directive or wrong number of arguments

---------------------------------------------------------------------------
vm:~/mem-leak-issue/redis$ ./src/redis-server --port

*** FATAL CONFIG FILE ERROR (Redis 255.255.255) ***
Reading the configuration file, at line 2
>>> 'port'
wrong number of arguments
```

Co-authored-by: Oran Agra <[email protected]>
naglera pushed a commit to naglera/placeholderkv that referenced this issue Apr 8, 2024
## Issues and solutions from #12817
1. Touch ProcessingEventsWhileBlocked and calling moduleCount() without
GIL in afterSleep()
    - Introduced: 
       Version: 7.0.0
       PR: #9963

   - Harm Level: Very High
If the module thread calls `RM_Yield()` before the main thread enters
afterSleep(),
and modifies `ProcessingEventsWhileBlocked`(+1), it will cause the main
thread to not wait for GIL,
which can lead to all kinds of unforeseen problems, including memory
data corruption.

   - Initial / Abandoned Solution:
      * Added `__thread` specifier for ProcessingEventsWhileBlocked.
`ProcessingEventsWhileBlocked` is used to protect against nested event
processing, but event processing
in the main thread and module threads should be completely independent
and unaffected, so it is safer
         to use TLS.
* Adding a cached module count to keep track of the current number of
modules, to avoid having to use `dictSize()`.
    
    - Related Warnings:
```
WARNING: ThreadSanitizer: data race (pid=1136)
  Write of size 4 at 0x0001045990c0 by thread T4 (mutexes: write M0):
    #0 processEventsWhileBlocked networking.c:4135 (redis-server:arm64+0x10006d124)
    valkey-io#1 RM_Yield module.c:2410 (redis-server:arm64+0x10018b66c)
    valkey-io#2 bg_call_worker <null>:83232836 (blockedclient.so:arm64+0x16a8)

  Previous read of size 4 at 0x0001045990c0 by main thread:
    #0 afterSleep server.c:1861 (redis-server:arm64+0x100024f98)
    valkey-io#1 aeProcessEvents ae.c:408 (redis-server:arm64+0x10000fd64)
    valkey-io#2 aeMain ae.c:496 (redis-server:arm64+0x100010f0c)
    valkey-io#3 main server.c:7220 (redis-server:arm64+0x10003f38c)
```

2. aeApiPoll() is not thread-safe
When using RM_Yield to handle events in a module thread, if the main
thread has not yet
entered `afterSleep()`, both the module thread and the main thread may
touch `server.el` at the same time.

    - Introduced: 
       Version: 7.0.0
       PR: #9963

   - Old / Abandoned Solution:
Adding a new mutex to protect timing between after beforeSleep() and
before afterSleep().
Defect: If the main thread enters the ae loop without any IO events, it
will wait until
the next timeout or until there is any event again, and the module
thread will
always hang until the main thread leaves the event loop.

    - Related Warnings:
```
SUMMARY: ThreadSanitizer: data race ae_kqueue.c:55 in addEventMask
==================
==================
WARNING: ThreadSanitizer: data race (pid=14682)
  Write of size 4 at 0x000100b54000 by thread T9 (mutexes: write M0):
    #0 aeApiPoll ae_kqueue.c:175 (redis-server:arm64+0x100010588)
    valkey-io#1 aeProcessEvents ae.c:399 (redis-server:arm64+0x10000fb84)
    valkey-io#2 processEventsWhileBlocked networking.c:4138 (redis-server:arm64+0x10006d3c4)
    valkey-io#3 RM_Yield module.c:2410 (redis-server:arm64+0x10018b66c)
    valkey-io#4 bg_call_worker <null>:16042052 (blockedclient.so:arm64+0x169c)

  Previous write of size 4 at 0x000100b54000 by main thread:
    #0 aeApiPoll ae_kqueue.c:175 (redis-server:arm64+0x100010588)
    valkey-io#1 aeProcessEvents ae.c:399 (redis-server:arm64+0x10000fb84)
    valkey-io#2 aeMain ae.c:496 (redis-server:arm64+0x100010da8)
    valkey-io#3 main server.c:7238 (redis-server:arm64+0x10003f51c)
```

## The final fix as the comments:
redis/redis#12817 (comment)
Optimized solution based on the above comment:

First, we add `module_gil_acquring` to indicate whether the main thread
is currently in the acquiring GIL state.

When the module thread starts to yield, there are two possibilities(we
assume the caller keeps the GIL):
1. The main thread is in the mid of beforeSleep() and afterSleep(), that
is, `module_gil_acquring` is not 1 now.
At this point, the module thread will wake up the main thread through
the pipe and leave the yield,
waiting for the next yield when the main thread may already in the
acquiring GIL state.
    
2. The main thread is in the acquiring GIL state.
The module thread release the GIL, yielding CPU to give the main thread
an opportunity to start
event processing, and then acquire the GIL again until the main thread
releases it.
This is what
redis/redis#12817 (comment)
mentioned direction.

---------

Co-authored-by: Oran Agra <[email protected]>
@xingbowang
Copy link

I talked with Madelyn offline. It seems the root cause of this is issue is different hash-seed was used on different nodes. I propose to use a boot-time config parameter "hash-seed-override" to override the randomly generated seed in the engine. This allows us to keep the default behavior unchanged. Meantime, if someone would like to fix this issue, they could add this config in the bootstrap parameter in their engine bootstrap configuration. Since the bootstrap configuration is used in every deployment of Valkey, it should not be a big effort to add an extra config parameter at boot time. The format could be something like.

config set hash-seed-override $SEED_IN_UINT64_T

Sample

config set hash-seed-override 1234567

enjoy-binbin added a commit that referenced this issue Aug 14, 2024
We are updating this variable in the main thread, and the
child threads can printing the logs at the same time. This
generating a warning in SANITIZER=thread:
```
WARNING: ThreadSanitizer: data race (pid=74208)
  Read of size 4 at 0x000102875c10 by thread T3:
    #0 serverLogRaw <null>:52173615 (valkey-server:x86_64+0x10003c556)
    #1 _serverLog <null>:52173615 (valkey-server:x86_64+0x10003ca89)
    #2 bioProcessBackgroundJobs <null>:52173615 (valkey-server:x86_64+0x1001402c9)

  Previous write of size 4 at 0x000102875c10 by main thread (mutexes: write M0):
    #0 afterSleep <null>:52173615 (valkey-server:x86_64+0x10004989b)
    #1 aeProcessEvents <null>:52173615 (valkey-server:x86_64+0x100031e52)
    #2 main <null>:52173615 (valkey-server:x86_64+0x100064a3c)
    #3 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
    #4 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
```

The refresh of daylight_active is not real time, we update
it in aftersleep, so we don't need a strong synchronization,
so using memory_order_relaxed. But also noted we are doing
load/store operations only for daylight_active, which is an
aligned 32-bit integer, so using memory_order_relaxed will
not provide more consistency than what we have today.

So this is just a cleanup that to clear the warning.

Signed-off-by: Binbin <[email protected]>
mapleFU pushed a commit to mapleFU/valkey that referenced this issue Aug 21, 2024
We are updating this variable in the main thread, and the
child threads can printing the logs at the same time. This
generating a warning in SANITIZER=thread:
```
WARNING: ThreadSanitizer: data race (pid=74208)
  Read of size 4 at 0x000102875c10 by thread T3:
    #0 serverLogRaw <null>:52173615 (valkey-server:x86_64+0x10003c556)
    valkey-io#1 _serverLog <null>:52173615 (valkey-server:x86_64+0x10003ca89)
    valkey-io#2 bioProcessBackgroundJobs <null>:52173615 (valkey-server:x86_64+0x1001402c9)

  Previous write of size 4 at 0x000102875c10 by main thread (mutexes: write M0):
    #0 afterSleep <null>:52173615 (valkey-server:x86_64+0x10004989b)
    valkey-io#1 aeProcessEvents <null>:52173615 (valkey-server:x86_64+0x100031e52)
    valkey-io#2 main <null>:52173615 (valkey-server:x86_64+0x100064a3c)
    valkey-io#3 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
    valkey-io#4 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
```

The refresh of daylight_active is not real time, we update
it in aftersleep, so we don't need a strong synchronization,
so using memory_order_relaxed. But also noted we are doing
load/store operations only for daylight_active, which is an
aligned 32-bit integer, so using memory_order_relaxed will
not provide more consistency than what we have today.

So this is just a cleanup that to clear the warning.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: mwish <[email protected]>
mapleFU pushed a commit to mapleFU/valkey that referenced this issue Aug 22, 2024
We are updating this variable in the main thread, and the
child threads can printing the logs at the same time. This
generating a warning in SANITIZER=thread:
```
WARNING: ThreadSanitizer: data race (pid=74208)
  Read of size 4 at 0x000102875c10 by thread T3:
    #0 serverLogRaw <null>:52173615 (valkey-server:x86_64+0x10003c556)
    valkey-io#1 _serverLog <null>:52173615 (valkey-server:x86_64+0x10003ca89)
    valkey-io#2 bioProcessBackgroundJobs <null>:52173615 (valkey-server:x86_64+0x1001402c9)

  Previous write of size 4 at 0x000102875c10 by main thread (mutexes: write M0):
    #0 afterSleep <null>:52173615 (valkey-server:x86_64+0x10004989b)
    valkey-io#1 aeProcessEvents <null>:52173615 (valkey-server:x86_64+0x100031e52)
    valkey-io#2 main <null>:52173615 (valkey-server:x86_64+0x100064a3c)
    valkey-io#3 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
    valkey-io#4 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
```

The refresh of daylight_active is not real time, we update
it in aftersleep, so we don't need a strong synchronization,
so using memory_order_relaxed. But also noted we are doing
load/store operations only for daylight_active, which is an
aligned 32-bit integer, so using memory_order_relaxed will
not provide more consistency than what we have today.

So this is just a cleanup that to clear the warning.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: mwish <[email protected]>
madolson pushed a commit that referenced this issue Sep 2, 2024
We are updating this variable in the main thread, and the
child threads can printing the logs at the same time. This
generating a warning in SANITIZER=thread:
```
WARNING: ThreadSanitizer: data race (pid=74208)
  Read of size 4 at 0x000102875c10 by thread T3:
    #0 serverLogRaw <null>:52173615 (valkey-server:x86_64+0x10003c556)
    #1 _serverLog <null>:52173615 (valkey-server:x86_64+0x10003ca89)
    #2 bioProcessBackgroundJobs <null>:52173615 (valkey-server:x86_64+0x1001402c9)

  Previous write of size 4 at 0x000102875c10 by main thread (mutexes: write M0):
    #0 afterSleep <null>:52173615 (valkey-server:x86_64+0x10004989b)
    #1 aeProcessEvents <null>:52173615 (valkey-server:x86_64+0x100031e52)
    #2 main <null>:52173615 (valkey-server:x86_64+0x100064a3c)
    #3 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
    #4 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
```

The refresh of daylight_active is not real time, we update
it in aftersleep, so we don't need a strong synchronization,
so using memory_order_relaxed. But also noted we are doing
load/store operations only for daylight_active, which is an
aligned 32-bit integer, so using memory_order_relaxed will
not provide more consistency than what we have today.

So this is just a cleanup that to clear the warning.

Signed-off-by: Binbin <[email protected]>
madolson pushed a commit that referenced this issue Sep 3, 2024
We are updating this variable in the main thread, and the
child threads can printing the logs at the same time. This
generating a warning in SANITIZER=thread:
```
WARNING: ThreadSanitizer: data race (pid=74208)
  Read of size 4 at 0x000102875c10 by thread T3:
    #0 serverLogRaw <null>:52173615 (valkey-server:x86_64+0x10003c556)
    #1 _serverLog <null>:52173615 (valkey-server:x86_64+0x10003ca89)
    #2 bioProcessBackgroundJobs <null>:52173615 (valkey-server:x86_64+0x1001402c9)

  Previous write of size 4 at 0x000102875c10 by main thread (mutexes: write M0):
    #0 afterSleep <null>:52173615 (valkey-server:x86_64+0x10004989b)
    #1 aeProcessEvents <null>:52173615 (valkey-server:x86_64+0x100031e52)
    #2 main <null>:52173615 (valkey-server:x86_64+0x100064a3c)
    #3 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
    #4 start <null>:52173615 (dyld:x86_64+0xfffffffffff5c365)
```

The refresh of daylight_active is not real time, we update
it in aftersleep, so we don't need a strong synchronization,
so using memory_order_relaxed. But also noted we are doing
load/store operations only for daylight_active, which is an
aligned 32-bit integer, so using memory_order_relaxed will
not provide more consistency than what we have today.

So this is just a cleanup that to clear the warning.

Signed-off-by: Binbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants