Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: hash agg state fetching panics at vnode should not be accessed in longevity test #6642

Closed
Tracked by #6640
BugenZhao opened this issue Nov 29, 2022 · 9 comments
Closed
Tracked by #6640
Assignees
Labels
component/streaming Stream processing related issue. found-by-longevity-test type/bug Something isn't working

Comments

@BugenZhao
Copy link
Member

Full logs here: https://ap-southeast-1.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#logsV2:log-groups/log-group/$252Frisingwave$252Frls-apse1-eks-a$252Frwc-3-longevity-20221128-044939/log-events/ip-10-0-6-144.ap-southeast-1.compute.internal-risingwave.var.log.containers.risingwave-compute-0_rwc-3-longevity-20221128-044939_compute-384e31312989248ef0f98db5e6e3e9e4178827011352cb7f3361e5b5d3092807.log

thread 'risingwave-streaming-actor' panicked at 'vnode 120 should not be accessed by this table', src/storage/src/table/mod.rs:153:5
stack backtrace:
thread 'risingwave-streaming-actor' panicked at 'vnode 19 should not be accessed by this table', src/storage/src/table/mod.rs:153:5
thread 'risingwave-streaming-actor' panicked at 'vnode 120 should not be accessed by this table', src/storage/src/table/mod.rs:153:5
thread 'risingwave-streaming-actor' panicked at 'vnode 114 should not be accessed by this table', src/storage/src/table/mod.rs:153:5
thread 'risingwave-streaming-actor' panicked at 'vnode 101 should not be accessed by this table', src/storage/src/table/mod.rs:153:5
thread 'risingwave-streaming-actor' panicked at 'vnode 143 should not be accessed by this table', src/storage/src/table/mod.rs:153:5
thread 'risingwave-streaming-actor' panicked at 'vnode 81 should not be accessed by this table', src/storage/src/table/mod.rs:153:5
thread 'risingwave-streaming-actor' panicked at 'vnode 96 should not be accessed by this table', src/storage/src/table/mod.rs:153:5
   0:     0x5573f4889f00 - std::backtrace_rs::backtrace::libunwind::trace::hf9eede5a9d6c67b2
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   1:     0x5573f4889f00 - std::backtrace_rs::backtrace::trace_unsynchronized::h0c91910fa04f4df9
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x5573f4889f00 - std::sys_common::backtrace::_print_fmt::he2b25433ad3bf420
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/sys_common/backtrace.rs:66:5
   3:     0x5573f4889f00 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hef18c387f5407d06
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/sys_common/backtrace.rs:45:22
   4:     0x5573f48b631e - core::fmt::write::h0a37462e5afa8d15
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/core/src/fmt/mod.rs:1209:17
   5:     0x5573f4881bc5 - std::io::Write::write_fmt::hc60ebc8a0cd9d3df
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/io/mod.rs:1682:15
   6:     0x5573f488b583 - std::sys_common::backtrace::_print::he3d440518a780bd7
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/sys_common/backtrace.rs:48:5
   7:     0x5573f488b583 - std::sys_common::backtrace::print::ha90d39b5839b2415
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/sys_common/backtrace.rs:35:9
   8:     0x5573f488b583 - std::panicking::default_hook::{{closure}}::h3448dcb19bce1d74
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/panicking.rs:267:22
   9:     0x5573f488b25a - std::panicking::default_hook::hd0a73ed3ba68d845
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/panicking.rs:286:9
  10:     0x5573f03e7ad2 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h6add681223999f58
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/alloc/src/boxed.rs:1952:9
  11:     0x5573f03e7ad2 - risingwave_rt::set_panic_hook::{{closure}}::h332cbc02a12bb36d
                               at /risingwave/src/utils/runtime/src/lib.rs:81:9
  12:     0x5573f03e7ad2 - std::panicking::update_hook::{{closure}}::h37ad3eecf6248393
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/panicking.rs:232:47
  13:     0x5573f488be09 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h789be6fd9a4af450
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/alloc/src/boxed.rs:1952:9
  14:     0x5573f488be09 - std::panicking::rust_panic_with_hook::h5e710c30b3aa952e
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/panicking.rs:692:13
  15:     0x5573f488bb87 - std::panicking::begin_panic_handler::{{closure}}::h88a18397eefe7379
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/panicking.rs:579:13
  16:     0x5573f488a3ac - std::sys_common::backtrace::__rust_end_short_backtrace::hfe085dbc16a7edcb
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/sys_common/backtrace.rs:138:18
  17:     0x5573f488b8a2 - rust_begin_unwind
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/std/src/panicking.rs:575:5
  18:     0x5573f48b3533 - core::panicking::panic_fmt::ha20927ebca9f6d14
                               at /rustc/b8c35ca26b191bb9a9ac669a4b3f4d3d52d97fb1/library/core/src/panicking.rs:65:14
  19:     0x5573f23cba6b - risingwave_storage::table::check_vnode_is_set::h4e52b90eca5f5ccd
                               at /risingwave/src/storage/src/table/mod.rs:153:5
  20:     0x5573f105a67b - risingwave_storage::table::compute_vnode::h02282d32362c46e8
                               at /risingwave/src/storage/src/table/mod.rs:116:9

This is really strange, as the dispatcher should already ensure that the group key has a correct vnode.

@chenzl25
Copy link
Contributor

Is there concrete SQL here?

@zwang28
Copy link
Contributor

zwang28 commented Dec 21, 2022

Have we ever seen other occurrence after #6833 ?

@BugenZhao
Copy link
Member Author

Have we ever seen other occurrence after #6833 ?

Seems not. But I guess they might be unrelated. 😰

@kwannoel
Copy link
Contributor

Trying to run longevity_test with q15,q16,q17 since these contain aggregations. See what's the outcome in 12h: https://risingwave-labs.slack.com/archives/C03L10EQAAG/p1672132900456469.

@kwannoel
Copy link
Contributor

kwannoel commented Dec 27, 2022

Some additional info, thanks to @sushantgupta for confirming:

  • Nexmark longevity is run daily, with default parameters, i.e. longevity_nexmark start.
  • May take a look here to see how it's run: https://risingwave-labs.slack.com/archives/C03L10EQAAG/p1672132900456469.
  • source data is just randomly generated, and passed into nexmark query.
  • we do not do node teardown in this setup, so this issue should be easier to reproduce (if it still exists).

Some ideas I will try for debugging:

  • Run longevity_nexmark with different queries, parameters and intervals.
  • Try to reproduce it within a reasonably short time span by increasing interval (currently default runs 12h).
  • and try to pinpoint the query causing this.
  • If works, try setup locally to run repeatedly and trace.

@kwannoel
Copy link
Contributor

So far q15,q16,q17 pass: https://buildkite.com/risingwave-test/longevity-test/builds/244#018552e7-ed47-46c1-bc84-dd14911e8866/578.

Perhaps other queries can reproduce this? Try again when https://github.com/risingwavelabs/risingwave-test/issues/158 is resolved.

Next steps:

  • Test q15,q16,q17 again, increase interval of data ingestion.
  • Test rest of queries.

@kwannoel
Copy link
Contributor

kwannoel commented Jan 4, 2023

Currently nexmark longevity fails pretty often: https://buildkite.com/risingwave-test/longevity-test.

Continue investigating when it is resolved. See: https://app.slack.com/client/T030LTU38S2/C0423G2NUF8/thread/C0423G2NUF8-1672652465.607889 for context.

@kwannoel
Copy link
Contributor

kwannoel commented Jan 11, 2023

Some further updates, longevity test is still blocked, progress will be tracked here: https://github.com/risingwavelabs/risingwave-test/issues/179.

Still need it, want to use it to pinpoint which nexmark query/queries is causing this error.

@fuyufjh
Copy link
Member

fuyufjh commented Jan 30, 2023

Cannot reproduce

@fuyufjh fuyufjh closed this as not planned Won't fix, can't repro, duplicate, stale Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/streaming Stream processing related issue. found-by-longevity-test type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants