Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaling test panicked at 'overwrites an existing key' #8199

Closed
Tracked by #6640
wangrunji0408 opened this issue Feb 27, 2023 · 17 comments
Closed
Tracked by #6640

scaling test panicked at 'overwrites an existing key' #8199

wangrunji0408 opened this issue Feb 27, 2023 · 17 comments
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@wangrunji0408
Copy link
Contributor

Describe the bug

--- STDERR:              risingwave_simulation::nexmark_recovery nexmark_recovery_q103 ---
--
  | thread '<unnamed>' panicked at 'overwrites an existing key!
  | table_id: 1015, vnode: 137, key: OwnedRow([Some(Int64(2100))])
  | value in storage: OwnedRow([Some(Int64(2100))])
  | value to write: OwnedRow([Some(Int64(2100))])', /risingwave/src/stream/src/common/table/state_table.rs:875:13

Found in #7623.
https://buildkite.com/risingwavelabs/pull-request/builds/18318#018691bc-bbb4-40e2-88d4-f4c9e1f52db2/116-197

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@wangrunji0408 wangrunji0408 added the type/bug Something isn't working label Feb 27, 2023
@github-actions github-actions bot added this to the release-0.1.18 milestone Feb 27, 2023
@yezizp2012
Copy link
Member

Cc @yuhao-su, PTAL.

@yuhao-su
Copy link
Contributor

yuhao-su commented Mar 2, 2023

I can't reproduce the overwrites ab existing key bug on my MBP.

But I encounter a result error bug:
The diff varies in each run

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`

Diff < left / right > :
...
...


', src/tests/simulation/tests/it/nexmark_recovery.rs:51:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
note: run with `MADSIM_TEST_SEED=1677741584723559058` environment variable to reproduce this error

@yuhao-su
Copy link
Contributor

yuhao-su commented Mar 2, 2023

I also so tried to commit a change the following config to in-memory and the test passed, so is it possible there is a hummock bug? Cc @hzxa21 @wenym1 , PTAL.

https://github.com/singularity-data/risingwave/blob/f671f09be0bcaa045c2812c444486dec1323e44f/src/tests/simulation/src/cluster.rs#L247

@soundOfDestiny
Copy link
Contributor

but there is not /risingwave/src/stream/src/common/table/state_table.rs:875:13

@yezizp2012
Copy link
Member

but there is not /risingwave/src/stream/src/common/table/state_table.rs:875:13

It was found in #7623 , the branch wrj/streaming-recovery-test haven't merged main branch since last run.

@soundOfDestiny
Copy link
Contributor

but there is not /risingwave/src/stream/src/common/table/state_table.rs:875:13

It was found in #7623 , the branch wrj/streaming-recovery-test haven't merged main branch since last run.

I had switched to wrj/streaming-recovery-test but still see let prefix_serializer = self.pk_serde.prefix(pk_prefix.len()); at /risingwave/src/stream/src/common/table/state_table.rs:875:13

@yezizp2012
Copy link
Member

but there is not /risingwave/src/stream/src/common/table/state_table.rs:875:13

It was found in #7623 , the branch wrj/streaming-recovery-test haven't merged main branch since last run.

I had switched to wrj/streaming-recovery-test but still see let prefix_serializer = self.pk_serde.prefix(pk_prefix.len()); at /risingwave/src/stream/src/common/table/state_table.rs:875:13

The log of last failed run as follows:

thread '<unnamed>' panicked at 'assertion failed: old_value == &old_value_in_inner', src/storage/src/mem_table.rs:279:33
--
  | stack backtrace:
  | 0: rust_begin_unwind
  | at /rustc/5ce39f42bd2c8bca9c570f0560ebe1fce4eddb14/library/std/src/panicking.rs:575:5
  | 1: core::panicking::panic_fmt
  | at /rustc/5ce39f42bd2c8bca9c570f0560ebe1fce4eddb14/library/core/src/panicking.rs:64:14
  | 2: core::panicking::panic

@soundOfDestiny
Copy link
Contributor

perhaps watermarks are not handled correctly in join executor and the old value is deleted by range tombstones

@yuhao-su
Copy link
Contributor

yuhao-su commented Mar 3, 2023

perhaps watermarks are not handled correctly in join executor and the old value is deleted by range tombstones

The cluster started without watermark. https://github.com/singularity-data/risingwave/blob/de37916fe87a1f1642f33060aa7c5add3c3c4d3a/src/tests/simulation/tests/it/nexmark_chaos.rs#L43

@soundOfDestiny
Copy link
Contributor

I add

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();

after barrier and it passed.

                AlignedMessage::Barrier(barrier) => {
                    let barrier_start_time = minstant::Instant::now();
                    self.flush_data(barrier.epoch).await?;

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();


    pub fn clear(&mut self) {
        self.inner.clear();
    }

cc @yuhao-su @hzxa21

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 3, 2023

I add

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();

after barrier and it passed.

                AlignedMessage::Barrier(barrier) => {
                    let barrier_start_time = minstant::Instant::now();
                    self.flush_data(barrier.epoch).await?;

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();


    pub fn clear(&mut self) {
        self.inner.clear();
    }

cc @yuhao-su @hzxa21

Is it possible that a vnode is moved back and forth multiple times for an actor and the operator cache for the vnode is stale?

@soundOfDestiny
Copy link
Contributor

I add

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();

after barrier and it passed.

                AlignedMessage::Barrier(barrier) => {
                    let barrier_start_time = minstant::Instant::now();
                    self.flush_data(barrier.epoch).await?;

                    self.side_l.ht.clear();
                    self.side_r.ht.clear();


    pub fn clear(&mut self) {
        self.inner.clear();
    }

cc @yuhao-su @hzxa21

Is it possible that a vnode is moved back and forth multiple times for an actor and the operator cache for the vnode is stale?

No, operator cache will clear upon vnode change

@yuhao-su
Copy link
Contributor

yuhao-su commented Mar 3, 2023

I tried to add the cache clear code upon barrier arrival. The test is still failing.

MADSIM_TEST_NUM=100 ./risedev sscale-test --cargo-profile ci-sim nexmark_recovery::nexmark_recovery_q103

@soundOfDestiny
Copy link
Contributor

I tried to add the cache clear code upon barrier arrival. The test is still failing.

MADSIM_TEST_NUM=100 ./risedev sscale-test --cargo-profile ci-sim nexmark_recovery::nexmark_recovery_q103

You can set the config as below

running 1 test
seed = 1677566688247109996
Configuration {
    config_path: "/var/folders/75/pr79ysh55cq8ks3_l9ghx8vw0000gn/T/.tmpUKM5Kt",
    frontend_nodes: 2,
    compute_nodes: 3,
    meta_nodes: 1,
    compactor_nodes: 2,
    compute_node_cores: 2,
    etcd_timeout_rate: 0.0,
    etcd_data_path: None,
}
test nexmark_recovery::nexmark_recovery_q5 ... FAILED

it will fail nexmark_recovery_q5 in current code, but will not fail if we add the cache clear code upon barrier arrival.

@soundOfDestiny
Copy link
Contributor

I have found the root cause of this issue.

DELETE [Some(Int64(1914)), Some(Int64(3)), Some(NaiveDateTime(NaiveDateTimeWrapper(2015-07-14T23:59:54)))]
INSERT [Some(Int64(1914)), Some(Int64(4)), Some(NaiveDateTime(NaiveDateTimeWrapper(2015-07-14T23:59:54)))]

happens in Epoch 2099937901150208.
Since Epoch 2099937901150208, there is only one checkpoint epoch EpochPair { curr: 2099937950302208, prev: 2099937933918208 }.
However, although Epoch 2099937933918208 has been synced, Epoch 2099937950302208 has not been synced, and then recovery starts.
After recovery, streaming code expects 4 but it is 3 in storage, which leads to panic.
Therefore I think the value in storage is correct.

cc @wangrunji0408 @yezizp2012 @yuhao-su

@yuhao-su
Copy link
Contributor

yuhao-su commented Mar 6, 2023

cc. @zwang28 PTAL

@soundOfDestiny
Copy link
Contributor

fixed by #8468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants