scaling test panicked at 'overwrites an existing key' #8199

wangrunji0408 · 2023-02-27T08:11:13Z

Describe the bug

--- STDERR:              risingwave_simulation::nexmark_recovery nexmark_recovery_q103 ---
--
  | thread '<unnamed>' panicked at 'overwrites an existing key!
  | table_id: 1015, vnode: 137, key: OwnedRow([Some(Int64(2100))])
  | value in storage: OwnedRow([Some(Int64(2100))])
  | value to write: OwnedRow([Some(Int64(2100))])', /risingwave/src/stream/src/common/table/state_table.rs:875:13

Found in #7623.
https://buildkite.com/risingwavelabs/pull-request/builds/18318#018691bc-bbb4-40e2-88d4-f4c9e1f52db2/116-197

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

yezizp2012 · 2023-02-28T05:42:54Z

Cc @yuhao-su, PTAL.

yuhao-su · 2023-03-02T07:34:06Z

I can't reproduce the overwrites ab existing key bug on my MBP.

But I encounter a result error bug:
The diff varies in each run

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`

Diff < left / right > :
...
...


', src/tests/simulation/tests/it/nexmark_recovery.rs:51:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
note: run with `MADSIM_TEST_SEED=1677741584723559058` environment variable to reproduce this error

yuhao-su · 2023-03-02T07:38:52Z

I also so tried to commit a change the following config to in-memory and the test passed, so is it possible there is a hummock bug? Cc @hzxa21 @wenym1 , PTAL.

https://github.com/singularity-data/risingwave/blob/f671f09be0bcaa045c2812c444486dec1323e44f/src/tests/simulation/src/cluster.rs#L247

soundOfDestiny · 2023-03-03T06:40:45Z

but there is not /risingwave/src/stream/src/common/table/state_table.rs:875:13

yezizp2012 · 2023-03-03T06:46:48Z

but there is not /risingwave/src/stream/src/common/table/state_table.rs:875:13

It was found in #7623 , the branch wrj/streaming-recovery-test haven't merged main branch since last run.

soundOfDestiny · 2023-03-03T06:48:23Z

but there is not /risingwave/src/stream/src/common/table/state_table.rs:875:13

It was found in #7623 , the branch wrj/streaming-recovery-test haven't merged main branch since last run.

I had switched to wrj/streaming-recovery-test but still see let prefix_serializer = self.pk_serde.prefix(pk_prefix.len()); at /risingwave/src/stream/src/common/table/state_table.rs:875:13

yezizp2012 · 2023-03-03T06:52:00Z

but there is not /risingwave/src/stream/src/common/table/state_table.rs:875:13

It was found in #7623 , the branch wrj/streaming-recovery-test haven't merged main branch since last run.

I had switched to wrj/streaming-recovery-test but still see let prefix_serializer = self.pk_serde.prefix(pk_prefix.len()); at /risingwave/src/stream/src/common/table/state_table.rs:875:13

The log of last failed run as follows:

thread '<unnamed>' panicked at 'assertion failed: old_value == &old_value_in_inner', src/storage/src/mem_table.rs:279:33
--
  | stack backtrace:
  | 0: rust_begin_unwind
  | at /rustc/5ce39f42bd2c8bca9c570f0560ebe1fce4eddb14/library/std/src/panicking.rs:575:5
  | 1: core::panicking::panic_fmt
  | at /rustc/5ce39f42bd2c8bca9c570f0560ebe1fce4eddb14/library/core/src/panicking.rs:64:14
  | 2: core::panicking::panic

soundOfDestiny · 2023-03-03T08:11:16Z

perhaps watermarks are not handled correctly in join executor and the old value is deleted by range tombstones

yuhao-su · 2023-03-03T08:21:21Z

perhaps watermarks are not handled correctly in join executor and the old value is deleted by range tombstones

The cluster started without watermark. https://github.com/singularity-data/risingwave/blob/de37916fe87a1f1642f33060aa7c5add3c3c4d3a/src/tests/simulation/tests/it/nexmark_chaos.rs#L43

soundOfDestiny · 2023-03-03T08:40:14Z

I add

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();

after barrier and it passed.

                AlignedMessage::Barrier(barrier) => {
                    let barrier_start_time = minstant::Instant::now();
                    self.flush_data(barrier.epoch).await?;

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();



    pub fn clear(&mut self) {
        self.inner.clear();
    }

cc @yuhao-su @hzxa21

hzxa21 · 2023-03-03T10:39:13Z

I add

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();

after barrier and it passed.

                AlignedMessage::Barrier(barrier) => {
                    let barrier_start_time = minstant::Instant::now();
                    self.flush_data(barrier.epoch).await?;

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();



    pub fn clear(&mut self) {
        self.inner.clear();
    }

cc @yuhao-su @hzxa21

Is it possible that a vnode is moved back and forth multiple times for an actor and the operator cache for the vnode is stale?

soundOfDestiny · 2023-03-03T10:40:24Z

I add

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();

after barrier and it passed.

                AlignedMessage::Barrier(barrier) => {
                    let barrier_start_time = minstant::Instant::now();
                    self.flush_data(barrier.epoch).await?;

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();



    pub fn clear(&mut self) {
        self.inner.clear();
    }

cc @yuhao-su @hzxa21

Is it possible that a vnode is moved back and forth multiple times for an actor and the operator cache for the vnode is stale?

No, operator cache will clear upon vnode change

yuhao-su · 2023-03-03T10:56:32Z

I tried to add the cache clear code upon barrier arrival. The test is still failing.

MADSIM_TEST_NUM=100 ./risedev sscale-test --cargo-profile ci-sim nexmark_recovery::nexmark_recovery_q103

soundOfDestiny · 2023-03-03T11:21:55Z

I tried to add the cache clear code upon barrier arrival. The test is still failing.

MADSIM_TEST_NUM=100 ./risedev sscale-test --cargo-profile ci-sim nexmark_recovery::nexmark_recovery_q103

You can set the config as below

running 1 test
seed = 1677566688247109996
Configuration {
    config_path: "/var/folders/75/pr79ysh55cq8ks3_l9ghx8vw0000gn/T/.tmpUKM5Kt",
    frontend_nodes: 2,
    compute_nodes: 3,
    meta_nodes: 1,
    compactor_nodes: 2,
    compute_node_cores: 2,
    etcd_timeout_rate: 0.0,
    etcd_data_path: None,
}
test nexmark_recovery::nexmark_recovery_q5 ... FAILED

it will fail nexmark_recovery_q5 in current code, but will not fail if we add the cache clear code upon barrier arrival.

soundOfDestiny · 2023-03-06T03:14:25Z

I have found the root cause of this issue.

DELETE [Some(Int64(1914)), Some(Int64(3)), Some(NaiveDateTime(NaiveDateTimeWrapper(2015-07-14T23:59:54)))]
INSERT [Some(Int64(1914)), Some(Int64(4)), Some(NaiveDateTime(NaiveDateTimeWrapper(2015-07-14T23:59:54)))]

happens in Epoch 2099937901150208.
Since Epoch 2099937901150208, there is only one checkpoint epoch EpochPair { curr: 2099937950302208, prev: 2099937933918208 }.
However, although Epoch 2099937933918208 has been synced, Epoch 2099937950302208 has not been synced, and then recovery starts.
After recovery, streaming code expects 4 but it is 3 in storage, which leads to panic.
Therefore I think the value in storage is correct.

cc @wangrunji0408 @yezizp2012 @yuhao-su

yuhao-su · 2023-03-06T04:05:20Z

cc. @zwang28 PTAL

soundOfDestiny · 2023-03-10T10:21:41Z

fixed by #8468

wangrunji0408 added the type/bug Something isn't working label Feb 27, 2023

github-actions bot added this to the release-0.1.18 milestone Feb 27, 2023

yuhao-su mentioned this issue Mar 2, 2023

Tracking: Critical Performance & Stability Issues #6640

Open

65 tasks

zwang28 self-assigned this Mar 8, 2023

soundOfDestiny mentioned this issue Mar 10, 2023

fix(recovery): wait_epoch should be called in recovery (close #8467) #8468

Merged

5 tasks

soundOfDestiny closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling test panicked at 'overwrites an existing key' #8199

scaling test panicked at 'overwrites an existing key' #8199

wangrunji0408 commented Feb 27, 2023

yezizp2012 commented Feb 28, 2023

yuhao-su commented Mar 2, 2023

yuhao-su commented Mar 2, 2023

soundOfDestiny commented Mar 3, 2023

yezizp2012 commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

yezizp2012 commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

yuhao-su commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

hzxa21 commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

yuhao-su commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

soundOfDestiny commented Mar 6, 2023

yuhao-su commented Mar 6, 2023 •

edited

Loading

soundOfDestiny commented Mar 10, 2023

scaling test panicked at 'overwrites an existing key' #8199

scaling test panicked at 'overwrites an existing key' #8199

Comments

wangrunji0408 commented Feb 27, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

yezizp2012 commented Feb 28, 2023

yuhao-su commented Mar 2, 2023

yuhao-su commented Mar 2, 2023

soundOfDestiny commented Mar 3, 2023

yezizp2012 commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

yezizp2012 commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

yuhao-su commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

hzxa21 commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

yuhao-su commented Mar 3, 2023

soundOfDestiny commented Mar 3, 2023

soundOfDestiny commented Mar 6, 2023

yuhao-su commented Mar 6, 2023 • edited Loading

soundOfDestiny commented Mar 10, 2023

yuhao-su commented Mar 6, 2023 •

edited

Loading