Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sink): support async for mongodb dynamodb #17645

Merged
merged 12 commits into from
Sep 27, 2024
Merged

Conversation

xxhZs
Copy link
Contributor

@xxhZs xxhZs commented Jul 10, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

as title and fix #17572

bench mongo db
before async

[avg: 93830 rows/s
p90: 98304 rows/s
p95: 98304 rows/s
p99: 100352 rows/s]

image

after async

avg: 146075 rows/s
p90: 163840 rows/s
p95: 167936 rows/s
p99: 169984 rows/s

image

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

We remove mogodb's option bulk_write_max_entries, and dynamodb's option default_max_batch_rows
We Add dynamodb's option max_batch_item_nums max_future_send_nums
The max_batch_item_nums is the max num of items in a batch_write_item, which should be >1 and <=25, and we set the default value to 25.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
The max_future_send_nums is the num of write futures that exist at the same time, he is related to the max parallelism units set by the user in dynamodb, this value is theoretically equal to max_parallelism_units /(stream_chunk_size /max_batch_item_nums), the default value of max parallelism units is 40000, so the default value of this value should be <360, here we default value is 256

We also need to prompt the user to select the appropriate max parallelism units for dynamodb, when the throughout of RisingWave writes > the max parallelism units set by dynamodb, an error will be reported

After this pr sink_douple is default for mongodb , dynamodb , redis

@xxhZs xxhZs requested a review from wenym1 July 10, 2024 08:52
@xxhZs xxhZs added the user-facing-changes Contains changes that are visible to users label Jul 10, 2024
src/connector/src/sink/mongodb.rs Outdated Show resolved Hide resolved
src/connector/src/sink/mongodb.rs Outdated Show resolved Hide resolved
src/connector/src/sink/mongodb.rs Outdated Show resolved Hide resolved
src/connector/src/sink/dynamodb.rs Outdated Show resolved Hide resolved
src/connector/src/sink/dynamodb.rs Outdated Show resolved Hide resolved
@xxhZs xxhZs requested a review from wenym1 September 18, 2024 03:30
Copy link
Contributor

@wenym1 wenym1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

} else {
CommandBuilder::Upsert(HashMap::new())
};
// let command_builder = if is_append_only {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the dead code.

}
Ok(())
self.payload_writer.flush_insert(&mut insert_builder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flush_insert can change to consume the ownership of insert_builder. So can flush_upsert.

upsert_builder: &mut HashMap<MongodbNamespace, UpsertCommandBuilder>,
) -> Result<()> {
) -> Result<Vec<impl futures::Future<Output = std::result::Result<(), SinkError>>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change to return try_join_all(...).boxed() here so that we can avoid allocating an unnecessary intermediate vec.

We can collect the future by converting the builder into an iterator of future by code like the following

try_join_all(upsert_builder.into_iter().flat_map(|(ns, builder)| {
    let (upsert, delete) = builder.build();
    let db = self.client.database(&ns.0);
    upsert
        .map(|upsert| Self::send_bulk_write_command(db.clone(), upsert))
        .into_iter()
        .chain(delete.map(|delete| Self::send_bulk_write_command(db, delete)))
}))

So as flush_insert.

async fn send_bulk_write_command(&self, database: &str, command: Document) -> Result<()> {
let db = self.client.database(database);

async fn send_bulk_write_command(db: Database, command: Document) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realize that both flush_upsert and flush_insert are returning a vector of the same future generated by this method.

If so, we can make the return type of flush_upsert and flush_insert to be the same, and then we don't need to add a boxed to the try_join_all future.

This method can be defined in this way

type SendBulkWriteCommandFuture = impl Future<Output = Result<()>> + 'static;

fn send_bulk_write_command(db: Database, command: Document) -> SendBulkWriteCommandFuture {
   async move {
        ...
    }
}

And then both flush_upsert and flush_insert can change to return TryJoinAll<SendBulkWriteCommandFuture>, and in write_chunk we don't need an extra boxed to unify the type.

@xxhZs xxhZs force-pushed the xxh/add-async-for-mongodb branch from b8af1c8 to e8333dd Compare September 26, 2024 10:51
Copy link
Contributor

@wenym1 wenym1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. Thanks for the PR.

src/connector/src/sink/dynamodb.rs Outdated Show resolved Hide resolved
src/connector/src/sink/dynamodb.rs Outdated Show resolved Hide resolved
@xxhZs xxhZs enabled auto-merge September 27, 2024 11:04
@xxhZs xxhZs added this pull request to the merge queue Sep 27, 2024
Merged via the queue into main with commit d877481 Sep 27, 2024
30 of 31 checks passed
@xxhZs xxhZs deleted the xxh/add-async-for-mongodb branch September 27, 2024 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Dynamodb sink error
2 participants