Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race found by miri test #279

Open
ParkMyCar opened this issue Jun 14, 2023 · 9 comments
Open

Data race found by miri test #279

ParkMyCar opened this issue Jun 14, 2023 · 9 comments
Assignees

Comments

@ParkMyCar
Copy link

Commit: d74d1e3c7cd922577767c8c2f9a8bdf7b7018e69
Rust Version: cargo 1.70.0-nightly (0e474cfd7 2023-03-31)
OS: macOS 13.4
CPU: Apple M2 Max

Running cargo +nightly miri test in the root of the crate, fails on the first test:

test cht::map::bucket::tests::get_insert_remove

with the error:

error: Undefined Behavior: Data race detected between (1) Write on thread `<unnamed>` and (2) Read on thread `<unnamed>` at alloc218679. (2) just happened here

I also disabled the race detector with MIRIFLAGS="-Zmiri-disable-data-race-detector", but when running with this flag the test never completes.

Note: I started looking into Moka because recently we tried using it in MaterializeInc/materialize#19614, but had to revert it because we were observing memory corruption.


Full Stack Trace:

error: Undefined Behavior: Data race detected between (1) Write on thread `<unnamed>` and (2) Read on thread `<unnamed>` at alloc218679. (2) just happened here
   --> /Users/parker.timmerman/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-epoch-0.9.15/src/atomic.rs:204:9
    |
204 |         &*(ptr as *const T)
    |         ^^^^^^^^^^^^^^^^^^^ Data race detected between (1) Write on thread `<unnamed>` and (2) Read on thread `<unnamed>` at alloc218679. (2) just happened here
    |
help: and (1) occurred earlier here
   --> src/cht/map/bucket_array_ref.rs:308:58
    |
308 |                 maybe_new_bucket_array.unwrap_or_else(|| Owned::new(BucketArray::default()));
    |                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
    = note: BACKTRACE (of the first span):
    = note: inside `<cht::map::bucket::BucketArray<i32, i32> as crossbeam_epoch::Pointable>::deref::<'_>` at /Users/parker.timmerman/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-epoch-0.9.15/src/atomic.rs:204:9: 204:28
    = note: inside `crossbeam_epoch::Shared::<'_, cht::map::bucket::BucketArray<i32, i32>>::as_ref` at /Users/parker.timmerman/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-epoch-0.9.15/src/atomic.rs:1518:18: 1518:31
note: inside `cht::map::bucket_array_ref::BucketArrayRef::<'_, i32, i32, std::collections::hash_map::RandomState>::get`
   --> src/cht/map/bucket_array_ref.rs:303:54
    |
303 |             if let Some(bucket_array_ref) = unsafe { bucket_array_ptr.as_ref() } {
    |                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `cht::map::bucket_array_ref::BucketArrayRef::<'_, i32, i32, std::collections::hash_map::RandomState>::insert_with_or_modify_entry_and::<i32, [closure@src/cht/segment.rs:310:17: 310:19], [closure@src/cht/segment.rs:311:17: 311:24], [closure@src/cht/segment.rs:829:77: 829:83]>`
   --> src/cht/map/bucket_array_ref.rs:207:27
    |
207 |         let current_ref = self.get(guard);
    |                           ^^^^^^^^^^^^^^^
note: inside `cht::segment::HashMap::<i32, i32>::insert_entry_and::<i32, [closure@src/cht/segment.rs:829:77: 829:83]>`
   --> src/cht/segment.rs:304:22
    |
304 |           let result = self
    |  ______________________^
305 | |             .bucket_array_ref(hash)
306 | |             // .insert_entry_and(key, hash, value, with_previous_entry);
307 | |             .insert_with_or_modify_entry_and(
...   |
312 | |                 with_previous_entry,
313 | |             );
    | |_____________^
note: inside closure
   --> src/cht/segment.rs:829:36
    |
829 |                         assert_eq!(map.insert_entry_and(j, map.hash(&j), j, |_, v| *v), None);
    |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@tatsuya6502
Copy link
Member

tatsuya6502 commented Jun 14, 2023

Thank you for reporting! I will take a look.

Note: I started looking into Moka because recently we tried using it in MaterializeInc/materialize#19614, but had to revert it because we were observing memory corruption.

It seems an older version of Moka (v0.9.x) was used in Materialize.

https://github.com/MaterializeInc/materialize/pull/19614/files#diff-82905fedd5d5e1119c9fed8511b2daf2a48d85057148ba441e99aa324dfe2712R41

moka = { version = "0.9.6", default-features = false, features = ["sync"] }

Do your segfault issues remain in the latest version of Moka (v0.11.2)? It will not be related to the issues but we fixed one incorrect mut usage in unsafe code in Moka v0.11.0.

@tatsuya6502 tatsuya6502 self-assigned this Jun 14, 2023
@ParkMyCar
Copy link
Author

Thanks for the quick reply! Unfortunately we didn't have the chance to try v0.11 as the segfault was preventing a production release. If we ever do get the chance to try it, I'll be sure to update this issue!

@tatsuya6502
Copy link
Member

Thanks for the reply.

Unfortunately we didn't have the chance to try v0.11 as the segfault was preventing a production release.

Understood.

I ran cargo +nightly miri test on the tests under cht module and found Miri reports the error on the following tests:

cht::segment::tests::concurrent_growth
cht::segment::tests::concurrent_insert_if_not_present
cht::segment::tests::concurrent_insert_with_or_modify
cht::segment::tests::concurrent_overlapped_growth
cht::segment::tests::concurrent_overlapped_insertion
cht::segment::tests::concurrent_overlapped_removal

$ cargo +nightly miri test --lib cht::segment::tests::concurrent_overlapped_removal
...

error: Undefined Behavior: Data race detected between (1) Read on thread `<unnamed>` and (2) Write on thread `<unnamed>` at alloc123212. (2) just happened here
    --> /Users/tatsuya/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-epoch-0.9.15/src/atomic.rs:208:9
     |
208  |         &mut *(ptr as *mut T)
     |         ^^^^^^^^^^^^^^^^^^^^^ Data race detected between (1) Read on thread `<unnamed>` and (2) Write on thread `<unnamed>` at alloc123212. (2) just happened here

I lightly reviewed some of the errors and relevant codes in moka but it seems we are using crossbeam-epoch correctly. I will review again and maybe check with the maintainers of crossbeam-epoch if they have any opinions.

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jun 17, 2023

I lightly reviewed some of the errors and relevant codes in moka but it seems we are using crossbeam-epoch correctly. I will review again and maybe check with the maintainers of crossbeam-epoch if they have any opinions.

Asked the question: crossbeam-rs/crossbeam#993

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jun 19, 2023

I reviewed Miri's data race errors on the following tests, created a minimum code to reproduce, and asked some questions to a maintainer of crossbeam-epoch:

  • cht::segment::tests::concurrent_growth
  • cht::segment::tests::concurrent_insert_if_not_present
  • cht::segment::tests::concurrent_insert_with_or_modify
  • cht::segment::tests::concurrent_overlapped_growth
  • cht::segment::tests::concurrent_overlapped_insertion
  • cht::segment::tests::concurrent_overlapped_removal

And our conclusion is these Miri errors are most likely false positives. We found no issues in relevant code in crossbeam-epoch and (so far) in the cht::* (concurrent hash table) modules in moka.

Here are the details:

  1. Miri reported data race errors in crossbeam-epoch code.

    • Miri reports the error on AArch64 platforms (macOS and Linux), but not on x86_64.
    • The crossbeam-epoch maintainer thinks that Miri does not support weak memory model very well yet.
      • and the errors on AArch64 seem false positives.
    • They added a workaround to crossbeam-epoch to avoid the error.
      • The workaround is enabled only when testing with Miri.
  2. Miri reported Stacked Borrows violations in crossbeam-epoch code.

    • Miri reports Stacked Borrows violations on x86_64 platforms, and on AArch64 platforms after they added the above workaround.
    • When we enable Tree Borrows rules in Miri instead of Stacked Borrows, Miri reports no errors.
    • The crossbeam-epoch maintainer thinks crossbeam-epoch is not compatible with Stacked Borrows but compatible with Tree Borrows.
    • Tree Borrows is an alternative to Stacked Borrows, and the Miri team appears to be moving to Tree Borrows.
      • So we will use Tree Borrows for testing moka.
  3. When Miri detects no errors, moka's tests seem to never complete.

    • That is because Miri is very slow.
      • It is a single-threaded interpreter.
    • Test will eventually complete, but takes too long.
      • e.g. cht::segment::tests::insertion took about 4 hours on my Mac with a plain M1 chip.
      • I have not tried others, but I guess they will take similar time.
    • We will make moka tests smaller when testing with Miri, so that we can run them on CI.

Running Miri with the workaround in crossbeam-epoch and Tree Borrows:

Cargo.toml

crossbeam-epoch = { git = "https://github.com/crossbeam-rs/crossbeam.git", branch = "master", optional = true }

Terminal Log (macOS arm64)

$ cargo tree -i crossbeam-epoch
crossbeam-epoch v0.9.15 (https://github.com/crossbeam-rs/crossbeam.git?branch=master#ce31c186)
└── moka v0.11.2 (...)

$ MIRIFLAGS="-Zmiri-tree-borrows" cargo +nightly miri test --lib cht::segment::tests::concurrent_overlapped_removal
Preparing a sysroot for Miri (target: aarch64-apple-darwin)... done
WARNING: Ignoring `RUSTC_WRAPPER` environment variable, Miri does not support wrapping.
    Finished test [unoptimized + debuginfo] target(s) in 0.08s
     Running unittests src/lib.rs (target/miri/aarch64-apple-darwin/debug/deps/moka-451ea5dfd08c9c07)

running 1 test
test cht::segment::tests::concurrent_overlapped_removal ... ^C

## Pressed Ctrl + C after ~1 minute. Otherwise it can take few hours to complete.

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jun 19, 2023

FYI, I am trying to reproduce the memory corruption (segmentation fault) using mokabench, a load generator I run for few hours before releasing moka. It helped to find and identify the root cause of #34.

I made some modifications, and ran it on macOS arm64 and Linux x86_64. But I have not had a chance to reproduce the issue.

Diff: https://gitlab.com/moka-labs/moka-gh279-materialize-memory-corruption/mokabench/-/compare/master...tokio-bytes

README: https://gitlab.com/moka-labs/moka-gh279-materialize-memory-corruption/mokabench/-/blob/10c5ba834145aaa60ef6ed4f3a3732ec61d8cd00/README.md

About This tokio-bytes Branch

This branch is for investigating the memory corruption issue in Materialize, which was mentioned in: #279

Run mokabench as the followings:

$ cargo run --release -- --invalidate --size-aware --eviction-listener immediate

Notable Changes:

  • Use moka v0.9.6 instead of the latest v0.11.x.
  • Use Materialize's SegmentedBytes instead of Arc<[u8]> as a part of the value type.
    • Copied bytes.rs and cast.rs from Materialize's source code (src/ore/src).
    • SegmentedBytes uses bytes crate from the Tokio project.
  • To detect memory corruption more reliably, read all bytes of the cached value after a read operation.
  • Disable all cache implementations except moka::sync::Cache.
    • However, use the sync cache in async context like Materialize does.
  • Disable the default features of moka, and enable sync and dash features.

NOTE: We would not recommend to use sync cache in async context if the cache has eviction listener enabled with the immediate notification mode. It may cause deadlocks if the eviction listener has .await points.

@taiki-e
Copy link

taiki-e commented Jun 21, 2023

The crossbeam-epoch maintainer thinks that Miri does not support weak memory model very well yet.

To be clear: As for consume memory ordering, I don't think it's a problem with Miri's weak memory model emulation. There is no API in the Rust standard library to express it, so the data race detector for Rust just doesn't support it.

@tatsuya6502
Copy link
Member

Ah, I misunderstood.

There is no API in the Rust standard library to express it, so the data race detector for Rust just doesn't support it.

Now I have better understanding. Thanks for clarifying!

@tatsuya6502
Copy link
Member

Note: I started looking into Moka because recently we tried using it in MaterializeInc/materialize#19614, but had to revert it because we were observing memory corruption.

It seems an older version of Moka (v0.9.x) was used in Materialize.
...

Do your segfault issues remain in the latest version of Moka (v0.11.2)? It will not be related to the issues but we fixed one incorrect mut usage in unsafe code in Moka v0.11.0.

I continued investigating the original memory corruption issue in Materialize with Moka v0.9.6, and I published v0.9.8 to crates.io with a potential fix.

  1. I was able to reproduce the segfault by running an integration test of Materialize on a 48-core machine (Amazon EC2).
  2. I analyzed the generated core dump.
  3. I found the root cause and it may have fixed for v0.11.0 by Fix the caches mutating a deque node through a NonNull pointer derived from a shared reference #259.
  4. I backported the fix to v0.9.8 and v0.10.3, and published them to creates.io.
  5. I ran the same integration test using v0.9.8, and it did not reproduce the issue.

For more details, see #281 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants