-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sink): introduce file sink in PARQUET format #17311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the TODO specified in the PR description done?
let filters = | ||
chunk.visibility() & ops.iter().map(|op| *op == Op::Insert).collect::<Bitmap>(); | ||
chunk.set_visibility(filters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, this seems to implictly convert a retractable stream chunk into an append-only stream chunk. Is it possible to receive a stream chunk with op != Op::Insert
here? If yes, I feel that we should give an error, ask user to set force_append_only
, and we only apply this filtering when the option is set. If not, we should add an assertion here instead of filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for file sink, user need to explicitly support force_append_only=true
.
let mut schema_fields = HashMap::new(); | ||
rw_schema.fields.iter().for_each(|field| { | ||
let res = schema_fields.insert(&field.name, &field.data_type); | ||
// This assert is to make sure there is no duplicate field name in the schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make sure this in validate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think validate
is only used to verify whether a connection can be established with the downstream sink. For example, for a file sink, it is used to verify whether the file can be written normally.
The rw_schema
here is defined by the user's create sink
statement and should be verified on the frontend to some extent, while I notice that other sinks also just get the schema directly from param without doing any validation, so I'd like to maintain the current implementation. Feel free to leave your stronger reasons.
It seems that there are many conflicts. I suggest we resolve the conflicts first. |
Not yet, I will add ci/e2e test after parquet source is merged. |
debug logging: I think the CN exit with code 139 is related to the newly added code:
If any of these three sinks appear alone, there will be no problem, but if there are more than two, 139 will appear. Maybe there is a shared mutable state between multiple Sinks, or maybe it’s an initialization problem. |
Thank @wenym1 very much, finally we resolved the mysterious 139 (segmentation fault) issue, let me briefly summarize what we did and our suspicions. The issue mainly occurred with the Here, let's take a look at the key changes.
As you can see that the difference lies essentially between Note that this is just a suspicion, despite this, I still don't fully understand
Anyways, we found a solution and developed a methodology. The common causes of segmentation faults typically include
While if a rust system encounters a 139 segmentation fault, it is likely due to some allocations on the stack being too large. In such cases, wrapping potentially problematic fields with |
Co-authored-by: William Wen <[email protected]>
Co-authored-by: congyi wang <[email protected]> Co-authored-by: William Wen <[email protected]>
You can attach a debugger to the process and it will show the reason for any exceptions (like segmentation fault). Typically it's stack overflow, and by checking the backtrace we can find where it happens. |
Thank you for your explanation! Can we attach debugger in out CI? This issue can not reproduce locally, only happen in CI. |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
close #15431
This pr introduce basic file sink, which allow RisingWave sink data into s3 or other file system(currently s3 and gcs) in parquet format.
As sink decouple is not enabled yet, we will force write files every time checkpoint barrier arrives, that is to say, between two checkpoint barrier, there will be parallelism files written. To distinguish written files, the current naming convention is
epoch_executor_id.suffix
.After this pr merged, we will introduce file sink batching strategy according to specific request, that is, enable sink decoupling. In addition, more sink file types will be introduced.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
To be edited(after all comments are resolved)
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.