Robust handling and management of S3 streams for MSQ shuffle storage #13741

rohangarg · 2023-02-02T10:50:37Z

Many MSQ jobs start failing with Premature EOF / SDKException errors when faultTolerance and durableShuffleStorage context params are enabled.
After adding some logging, it was found that these exceptions are preceded by ConnectionReset exceptions which might happen from S3 server side incase we open an S3 request for very long time. Trying out a change where we first download the file on s3 to local disk and then read fixed the issue for the jobs - but the caveat was that multiple threads downloading the files concurrently might put pressure on disks.
So, the current change downloads the file chunk-by-chunk locally at runtime and then stitch the chunks together under a SequenceInputStream for upstream MSQ processing engine. Each chunk is currently of size 100MB and is downloaded eagerly to a local file. When that chunk is consumed, the local file is deleted and the next chunk is queue is processed.

This PR has:

imply-cheddar

A few suggestions for fixes, it's probably "okay" to merge as-is, but I think we should be a bit more allergic to retries (used willy-nilly, they often just cause more pain than they resolve).

I'm going to approve because the current code and usage of retries isn't really a regression from the old code, but rather all of the suggestions are to improve the general state of the code overall.

imply-cheddar · 2023-02-03T06:10:56Z

...-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3StorageConnector.java


  public S3StorageConnector(S3OutputConfig config, ServerSideEncryptingAmazonS3 serverSideEncryptingAmazonS3)
  {
    this.config = config;
    this.s3Client = serverSideEncryptingAmazonS3;
+    if (config.getTempDir() != null) {


If getTempDir is null, this code is still gonna fail, but a lot later on. You can test and validate here instead.

Thanks - added more validation for the config vars

imply-cheddar · 2023-02-03T06:18:40Z

...-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3StorageConnector.java

-        config.getMaxRetry()
-    );
+          FileUtils.copyLarge(
+              () -> new RetryingInputStream<>(


Do you really need to do this inside of a retrying input stream? We've had a rash of issues where code was trying to retry a thing, and then a layer above it was trying to retry a thing and then a layer above that leading to multiplicative growth of retries and really long delays in processing. Given that the whole fault-tolerance stuff exists and will retry failed tasks anyway, perhaps this isn't the right layer to add yet another set of retries? Taht or like, limit it to only 2 retries at a max if we really do want to have a retry at this layer.

I think we can have the retry since all other parts of the code also do the same (probably derived from production systems previously). We could re-consider our retrying in future but that can be done separately.
Also, agree on using a fixed small number of retries since SDK also does some retries. Changed the maximum tries to 3 which means 2 retries.

imply-cheddar · 2023-02-03T06:19:05Z

...-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3StorageConnector.java

+          );
+        }
+        catch (IOException e) {
+          throw new UncheckedIOException(e);


What's an UncheckedIOException other than just a RuntimeException?

Yes, they are practically the same - did UncheckedIOException since it seemed natural. Changed to custom RE with more info

imply-cheddar · 2023-02-03T06:20:17Z

...-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/output/S3StorageConnector.java

+              // close should be idempotent
+              if (isClosed.get()) {
+                return;
+              }


I'm very much scared of places where the contract is that close can be called multiple times. It's indicative of a lack of ability and understanding of when the lifecycle of an object is truly over and often highlights other problems.

I agree too that if close is being called multiple times, it could indicate some loose contract in the execution layer. Will check that independently - trying to push this change through since it unblocks the usage of durable storage feature in MSQ.

imply-cheddar · 2023-02-03T06:21:08Z

...e/s3-extensions/src/test/java/org/apache/druid/storage/s3/output/S3StorageConnectorTest.java

  public void pathRead() throws IOException
  {
    EasyMock.reset(S3_CLIENT);
+    ObjectMetadata objectMetadata = new ObjectMetadata();
+    long contentLength = "test".getBytes(StandardCharsets.UTF_8).length;
+    objectMetadata.setContentLength(contentLength);
    S3Object s3Object = new S3Object();
    s3Object.setObjectContent(new ByteArrayInputStream("test".getBytes(StandardCharsets.UTF_8)));
-    EasyMock.expect(S3_CLIENT.getObject(new GetObjectRequest(BUCKET, PREFIX + "/" + TEST_FILE))).andReturn(s3Object);
+    EasyMock.expect(S3_CLIENT.getObjectMetadata(EasyMock.anyObject())).andReturn(objectMetadata);
+    EasyMock.expect(S3_CLIENT.getObject(
+        new GetObjectRequest(BUCKET, PREFIX + "/" + TEST_FILE).withRange(0, contentLength - 1))
+    ).andReturn(s3Object);
    EasyMock.replay(S3_CLIENT);


You don't have any testing for the retry behavior. If retries exist with the configs, please test all of the different ways that retries can happen and whether we expect multiplicativity or not.

The configs from retries are removed and they are hard coded now. I think for S3 in future, we can create a custom S3RetryInputStream which would be easier to test for retries and can be used in all s3 read paths. I'm not sure if we can test the multiplicative retries since we mock the AWSClient currently.

…pache#13741)

rohangarg force-pushed the s3_handling branch from a0d8521 to ce61691 Compare February 2, 2023 11:54

Robust handling and management of S3 streams for MSQ shuffle storage

b94fefa

rohangarg force-pushed the s3_handling branch from ce61691 to b94fefa Compare February 3, 2023 04:59

imply-cheddar approved these changes Feb 3, 2023

View reviewed changes

rohangarg added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Feb 3, 2023

Review comments

4954c05

rohangarg force-pushed the s3_handling branch from b4c0259 to 4954c05 Compare February 6, 2023 10:21

rohangarg added 2 commits February 6, 2023 17:16

empty commit

ea6f5ca

Fix up

3a83d53

rohangarg merged commit a0f8889 into apache:master Feb 7, 2023

abhagraw pushed a commit to abhagraw/druid that referenced this pull request Feb 8, 2023

Robust handling and management of S3 streams for MSQ shuffle storage (a…

59d4654

…pache#13741)

cryptoe mentioned this pull request Mar 27, 2023

Eagerly fetching remote s3 files leading to out of disk (OOD) #13981

Merged

3 tasks

clintropolis added this to the 26.0 milestone Apr 10, 2023

techdocsmith mentioned this pull request Apr 12, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robust handling and management of S3 streams for MSQ shuffle storage #13741

Robust handling and management of S3 streams for MSQ shuffle storage #13741

rohangarg commented Feb 2, 2023

imply-cheddar left a comment

imply-cheddar Feb 3, 2023

rohangarg Feb 7, 2023

imply-cheddar Feb 3, 2023

rohangarg Feb 7, 2023

imply-cheddar Feb 3, 2023

rohangarg Feb 7, 2023

imply-cheddar Feb 3, 2023

rohangarg Feb 7, 2023

imply-cheddar Feb 3, 2023

rohangarg Feb 7, 2023

Robust handling and management of S3 streams for MSQ shuffle storage #13741

Robust handling and management of S3 streams for MSQ shuffle storage #13741

Conversation

rohangarg commented Feb 2, 2023

imply-cheddar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment