Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use opendal as the s3 sdk by default #18011

Merged
merged 5 commits into from
Aug 22, 2024
Merged

Conversation

hzxa21
Copy link
Collaborator

@hzxa21 hzxa21 commented Aug 13, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Non-s3 object store backend has been using opendal for a while. This PR switches the s3 sdk to opendal by default as well.

related: #14321

There is a behavior change for s3 retry:

Prior to this PR:

After this PR:

The reason why we have this behavior change is because we didn't disable the internal retry when using aws-sdk-s3, which unexpectedly add more retries than specified in RW config. Given that all object stores other than s3 already honored the retry attempts specified in config [storage.object_store.retry], I think it is okay to correct the unexpected behavior in S3 after switching to opendal as well and let it fully honor our config.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

The sdk that connects to S3 object store state backend is switched from aws-sdk-s3 to opendal. This is an internal 3rd party dependency change, intended to unify the sdk RisingWave uses to connect to different object store backend and reduce burden on maintenance. This change should be seamingless but if you experience instability or unexpected errors when running RisingWave on S3, please file an issue in our GitHub repo.

We will keep aws-sdk-s3 as a fallback option before we fully deprecate it in 1-2 releases. You can use the following config to switch back to aws-sdk-s3:

[storage.object_store.s3.developer]
use_opendal = false

@hzxa21 hzxa21 requested review from Li0k and wcy-fdu August 13, 2024 02:47
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Aug 13, 2024

The following longevity test passed with similar resource usage, source throughput and no compaction lag using the same nightly image before and after switching to opendal:

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Aug 13, 2024

nexmark perf result:
image

Nexmark Name Category A B percentage_change
nexmark-q3-no-condition-blackhole nightly-20240801-A 737488.98 780340.81 -5.49
nexmark-q5-rewrite-blackhole nightly-20240801-A 178604.56 185153.18 -3.54
nexmark-q105-blackhole nightly-20240801-A 248458.21 255995.12 -2.94
nexmark-q104-blackhole nightly-20240801-A 385924.43 397419.77 -2.89
nexmark-q103-blackhole nightly-20240801-A 538433.35 552686.84 -2.58
nexmark-q6-group-top1-blackhole-watermark nightly-20240801-A 218919.25 223105.68 -1.88
nexmark-q17-blackhole nightly-20240801-A 557997.57 565969.57 -1.41
nexmark-q13-blackhole nightly-20240801-A 902050.96 912347.09 -1.13
nexmark-q0-blackhole nightly-20240801-A 961466.26 971879.04 -1.07
nexmark-q5-many-windows-blackhole-watermark nightly-20240801-A 18659.28 18801.35 -0.76
nexmark-q18-blackhole-watermark nightly-20240801-A 274057.19 274919.90 -0.31
nexmark-q7-rewrite-blackhole nightly-20240801-A 944367.48 943635.39 0.08
nexmark-q19-blackhole nightly-20240801-A 215850.20 215533.11 0.15
nexmark-q18-blackhole nightly-20240801-A 302847.91 301685.38 0.39
nexmark-q101-blackhole nightly-20240801-A 372370.46 369920.53 0.66
nexmark-q12-blackhole nightly-20240801-A 860672.42 851876.69 1.03
nexmark-q8-blackhole nightly-20240801-A 650701.28 642525.48 1.27
nexmark-q16-blackhole nightly-20240801-A 108489.16 106618.43 1.75
nexmark-q9-blackhole-watermark nightly-20240801-A 221857.52 216993.08 2.24
nexmark-q20-blackhole nightly-20240801-A 394281.68 385083.45 2.39
nexmark-q102-blackhole nightly-20240801-A 233524.33 227996.14 2.42
nexmark-q5-blackhole-watermark nightly-20240801-A 359371.41 348187.12 3.21
nexmark-q8-blackhole-watermark nightly-20240801-A 628568.36 605702.11 3.78
nexmark-q5-blackhole nightly-20240801-A 141079.84 135382.16 4.21
nexmark-q15-blackhole nightly-20240801-A 481344.28 460756.91 4.47
nexmark-q7-blackhole nightly-20240801-A 671364.28 642561.12 4.48
nexmark-q7-blackhole-watermark nightly-20240801-A 630618.80 599397.81 5.21
nexmark-q9-blackhole nightly-20240801-A 232522.26 214654.96 8.32
nexmark-q4-blackhole nightly-20240801-A 307474.27 266900.33 15.20

Copy link
Contributor

@wcy-fdu wcy-fdu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort in testing!

BTW, Should we make your testing workload as standard release testing, and do such tests every time bumping OpenDAL version and releasing RisingWave? Can we configure some automatic jobs to complete one-click testing?

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Aug 13, 2024

BTW, Should we make your testing workload as standard release testing, and do such tests every time bumping OpenDAL version and releasing RisingWave?

All the testing I have done is from the regular testing pipeline:

  • The 10K longevity test will be run daily and the 300K longevity test will be run weekly.
  • Perf test is the standard nexmark perf test and it will be run daily.

Can we configure some automatic jobs to complete one-click testing?

Can we trigger perf/longevity test in PR? (not an urgent request) cc @huangjw806

Copy link
Contributor

@Li0k Li0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks for your great contribution.

I have some questions about this PR and would like to have some short descriptions so that we can easily trace the PR in the future!

  1. when we officially switch to opendal, some behaviours will change, such as whether opendal retries streaming requests, e.g. streaming_read / streaming_upload (I'm not entirely sure about this detail, if there is a change, please add a short description)
  2. after switching to opendal, the number of retries provided by the configuration may change from the previous one
  • For example, when we use aws-sdk, by default we use the 2 retries provided by sdk, which is tricky, but after switching to opendal, the number of retries will be reduced.
  1. opendal not support retry_unknown_service_error, We have to find out if opendal's retry implementation contains all the error types/error codes we need.

@Li0k
Copy link
Contributor

Li0k commented Aug 13, 2024

Test error: It seems that we are printing a lot of duplicate logs

image

@graphite-app graphite-app bot requested a review from a team August 21, 2024 16:52
@hzxa21 hzxa21 added the user-facing-changes Contains changes that are visible to users label Aug 21, 2024
@hzxa21 hzxa21 force-pushed the patrick/s3-default-opendal branch from 1bb9fa3 to 49be510 Compare August 21, 2024 17:15
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Aug 22, 2024

  1. when we officially switch to opendal, some behaviours will change, such as whether opendal retries streaming requests, e.g. streaming_read / streaming_upload (I'm not entirely sure about this detail, if there is a change, please add a short description)

Both streaming_read / streaming_upload retries are implemented for opendal object store so there is no change in this part.

  1. after switching to opendal, the number of retries provided by the configuration may change from the previous one
  • For example, when we use aws-sdk, by default we use the 2 retries provided by sdk, which is tricky, but after switching to opendal, the number of retries will be reduced.

The retry attempts do change after this PR for S3. I document the behavior change in the PR description.

  1. opendal not support retry_unknown_service_error, We have to find out if opendal's retry implementation contains all the error types/error codes we need.

OpenDAL doesn't expose the exact error codes so there is no way we can check error codes in our codes. However, OpenDAL will use the ErrorKind::Unexpected to represent errors it cannot recognize. Therefore, we can use that to implement retry_unknown_service_error. I have push a commit to support that: 49be510

@Li0k PTAL.
These are all valid points. Thanks for the comments.

Copy link
Contributor

@Li0k Li0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM, Thanks for the test with all the contributions .

@@ -567,6 +567,7 @@ impl<OS: ObjectStore> MonitoredObjectStore<OS> {
pub async fn upload(&self, path: &str, obj: Bytes) -> ObjectResult<()> {
let operation_type = OperationType::Upload;
let operation_type_str = operation_type.as_str();
let media_type = self.media_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: How about using the same variable name? media_type and engine_type are too similar.

@hzxa21 hzxa21 added this pull request to the merge queue Aug 22, 2024
Merged via the queue into main with commit d271697 Aug 22, 2024
35 of 36 checks passed
@hzxa21 hzxa21 deleted the patrick/s3-default-opendal branch August 22, 2024 15:22
github-merge-queue bot pushed a commit that referenced this pull request Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants