Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: variant-length hash key uses value encoding which does not respect Eq #9659

Closed
BugenZhao opened this issue May 8, 2023 · 2 comments · Fixed by #9671
Closed

bug: variant-length hash key uses value encoding which does not respect Eq #9659

BugenZhao opened this issue May 8, 2023 · 2 comments · Fixed by #9671
Labels
component/common Common components, such as array, data chunk, expression. type/bug Something isn't working
Milestone

Comments

@BugenZhao
Copy link
Member

BugenZhao commented May 8, 2023

Theoretically, we should use serialize and deserialize from trait HashKeySerde for the hash keys in the in-memory hash map. But for some reason, we're instead totally ignoring this and using value encoding if there's any column with a variant length (like VARCHAR) in the key.

This is incorrect since value encoding does not respect Eq defined on the datum (SQL types). For example, Decimal(1.0) vs Decimal(1.00) or Interval(24 hour) vs Interval(1 day) are considered equal under the GROUP BY context, while their value encoding values differ and represent their real value.

/// Trait for value types that can be serialized to or deserialized from hash keys.
///
/// Note that this trait is more like a marker suggesting that types that implement it can be
/// encoded into the hash key. The actual encoding/decoding method is not limited to
/// [`HashKeySerDe`]'s fixed-size implementation.
pub trait HashKeySerDe<'a>: ScalarRef<'a> {
type S: AsRef<[u8]>;
fn serialize(self) -> Self::S;
fn deserialize<R: Read>(source: &mut R) -> Self;
fn read_fixed_size_bytes<R: Read, const N: usize>(source: &mut R) -> [u8; N] {
let mut buffer: [u8; N] = [0u8; N];
source
.read_exact(&mut buffer)
.expect("Failed to read fixed size serialized key!");
buffer
}
}

To simply reproduce this bug...

dev=> create table t (v1 varchar, v2 interval);
dev=> insert into t values ('hello', interval '1 d');
dev=> insert into t values ('hello', interval '24 h');

dev=> select v2 from t group by v2;
  v2   
-------
 1 day
(1 row)

dev=> select v2 from t group by v1, v2;
    v2    
----------
 1 day
 24:00:00 <- Wrong
(2 rows) 

dev=> create materialized view mv as select v2 from t group by v1, v2;
ERROR:  QueryError: internal error: transport error 
thread 'risingwave-streaming-actor' panicked at 'mem-table operation inconsistent! table_id: 1003, vnode: 72, key: OwnedRow([Some(Utf8("hello")), Some(Interval(Interval { months: 0, days: 1, usecs: 0 }))]), prev: Insert(Ok([Some(Int64(1))])), new: Insert(Ok([Some(Int64(1))]))', /Users/bugenzhao/Developer/S/risingwave-opensource/src/stream/src/common/table/state_table.rs:625:17
stack backtrace:
   0: rust_begin_unwind
             at /rustc/28a29282f6dde2e4aba6e1e4cfea5c9430a00217/library/std/src/panicking.rs:577:5
   1: core::panicking::panic_fmt
             at /rustc/28a29282f6dde2e4aba6e1e4cfea5c9430a00217/library/core/src/panicking.rs:67:14
   2: risingwave_stream::common::table::state_table::StateTableInner<S,SD>::handle_mem_table_error
             at ./src/stream/src/common/table/state_table.rs:625:17
   3: risingwave_stream::common::table::state_table::StateTableInner<S,SD>::insert_inner::{{closure}}
             at ./src/stream/src/common/table/state_table.rs:650:33
   4: core::result::Result<T,E>::unwrap_or_else
             at /rustc/28a29282f6dde2e4aba6e1e4cfea5c9430a00217/library/core/src/result.rs:1467:23
   5: risingwave_stream::common::table::state_table::StateTableInner<S,SD>::insert_inner
             at ./src/stream/src/common/table/state_table.rs:648:9
   6: risingwave_stream::common::table::state_table::StateTableInner<S,SD>::insert
             at ./src/stream/src/common/table/state_table.rs:672:9
   7: risingwave_stream::common::table::state_table::StateTableInner<S,SD>::write_record
             at ./src/stream/src/common/table/state_table.rs:705:43
   8: risingwave_stream::executor::hash_agg::HashAggExecutor<K,S>::flush_data::{{closure}}
             at ./src/stream/src/executor/hash_agg.rs:471:21
@BugenZhao BugenZhao added type/bug Something isn't working component/common Common components, such as array, data chunk, expression. labels May 8, 2023
@github-actions github-actions bot added this to the release-0.20 milestone May 8, 2023
@BugenZhao
Copy link
Member Author

I'll introduce a big refactor for this old module. 😇

@BugenZhao
Copy link
Member Author

I'm afraid that requires key encoding aka. memcomparable. (I had wrote this in #5526)

Key encoding ensures the equality checking (a = b) is constant with encoded data (serialize(a) = serialize(a)). For example, if you run a Hash Join on A.d = B.d where d is a DECIMAL column, you have to ensure 1.0 can be joined with 1.00, while 1.0 and 1.00 have identical key encoding but different value encoding.

Originally posted by @fuyufjh in #5596 (comment)

Looks like we've already realized this long ago. Not sure why the original PR was merged without fixing this. 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/common Common components, such as array, data chunk, expression. type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant