Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dumpling will not retry upload to GCS if the server returned 503 #56127

Closed
kennytm opened this issue Sep 18, 2024 · 6 comments · Fixed by #57513
Closed

Dumpling will not retry upload to GCS if the server returned 503 #56127

kennytm opened this issue Sep 18, 2024 · 6 comments · Fixed by #57513
Assignees
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.3 affects-8.5 This bug affects the 8.5.x(LTS) versions. component/dumpling This is related to Dumpling of TiDB. found/gs found by gs good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.

Comments

@kennytm
Copy link
Contributor

kennytm commented Sep 18, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

(I used mitmproxy to inject the 503 error. For testing there should be some zero-external-dependency means to do so 🤔 Also I used a local fake-gcs-server serving HTTP to avoid distractions of installing the self-signed TLS CA.)

  1. Get mitmproxy

  2. Prepare the following script, which will inject 503 for the first two requests to */o (the URL for uploading objects to GCS)

    # inject-503.py
    from mitmproxy import http
    
    class InjectErrorAddon:
        def __init__(self):
            self._counter = 0
    
        def request(self, flow: http.HTTPFlow) -> None:
            path = flow.request.path_components
            if path and path[-1] == 'o':
                self._counter += 1
                if self._counter <= 2:
                    flow.response = http.Response.make(503, '{}')
    
    addons = [InjectErrorAddon()]
  3. Run mitmproxy loaded with this script

    mitmproxy -s inject-503.py
  4. Patch dumpling to use this proxy:

    diff --git a/dumpling/export/config.go b/dumpling/export/config.go
    index 52337ec732..e960d13c7a 100644
    --- a/dumpling/export/config.go
    +++ b/dumpling/export/config.go
    @@ -12,6 +12,8 @@ import (
    	"strings"
    	"text/template"
    	"time"
    +	"net/http"
    +	"net/url"
    
    	"github.com/coreos/go-semver/semver"
    	"github.com/docker/go-units"
    @@ -711,8 +713,19 @@ func (conf *Config) createExternalStorage(ctx context.Context) (storage.External
    		return nil, errors.Trace(err)
    	}
    
    +	tx := http.DefaultTransport.(*http.Transport)
    +	tx.Proxy = func(req *http.Request) (*url.URL, error) {
    +		return url.Parse("http://127.0.0.1:8080")
    +	}
    +
    	// TODO: support setting httpClient with certification later
    -	return storage.New(ctx, b, &storage.ExternalStorageOptions{})
    +	return storage.New(ctx, b, &storage.ExternalStorageOptions{
    +		HTTPClient:    &http.Client{
    +			Transport: tx,
    +		},
    +	})
    }
    
    const (
  5. Run dumpling.

2. What did you expect to see? (Required)

Given that we only inject 503 twice, Dumpling should be able to successfully upload the file on its 3rd try and the whole process succeed.

Inside the mitmproxy console, we should be able to see two 503 responses like

>>17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Faaa.tab2_a.000000000.sql&prettyPrint=false&projection=full&uploadType=multipart    503                        2b  10ms 
  17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Faaa.tab2_b.000000000.sql&prettyPrint=false&projection=full&uploadType=multipart    503                        2b   9ms 
  17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Faaa.tab2_a.000000000.sql&prettyPrint=false&projection=full&uploadType=multipart    200    application/json  924b  15ms 
  17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Faaa.tab2_b.000000000.sql&prettyPrint=false&projection=full&uploadType=multipart    200    application/json  924b  13ms 
  17:59:19 HTTP  POST  127.0.0.1  /upload/storage/v1/b/db-transfer/o?alt=json&name=dumpling%2Fmetadata&prettyPrint=false&projection=full&uploadType=multipart                    200    application/json  844b  13ms 

3. What did you see instead (Required)

Dumpling failed without any retry, with logs like

[2024/09/18 17:57:19.643 +08:00] [WARN] [writer_util.go:514] ["fail to close file"] [path=gcs://db-transfer/dumpling/aaa.tab2_a.000000000.sql] [error="googleapi: got HTTP response code 503 with body: {}"] [errorVerbose="«snip»"]
[2024/09/18 17:57:19.644 +08:00] [WARN] [writer_util.go:514] ["fail to close file"] [path=gcs://db-transfer/dumpling/aaa.tab2_b.000000000.sql] [error="context canceled"] [errorVerbose="«snip»"]
...
[2024/09/18 17:57:19.644 +08:00] [ERROR] [main.go:78] ["dump failed error stack info"] [error="googleapi: got HTTP response code 503 with body: {}"] [errorVerbose="«snip»"]

4. What is your TiDB version? (Required)

Dumpling v8.3.0

@kennytm kennytm added type/bug The issue is confirmed as a bug. component/br This issue is related to BR of TiDB. component/dumpling This is related to Dumpling of TiDB. affects-8.3 labels Sep 18, 2024
@kennytm
Copy link
Contributor Author

kennytm commented Sep 18, 2024

The reason GCS did not retry is because the upload is not considered an "idempotent operation". The default retry policy is RetryIdempotent, which said:

RetryIdempotent causes only idempotent operations to be retried when the service returns a transient error. Using this policy, fully idempotent operations (such as ObjectHandle.Attrs()) will always be retried. Conditionally idempotent operations (for example ObjectHandle.Update()) will be retried only if the necessary conditions have been supplied (in the case of ObjectHandle.Update() this would mean supplying a Conditions.MetagenerationMatch condition is required).

Patching the GCS to use RetryAlways will make Dumpling perform like the expected behavior

diff --git a/br/pkg/storage/gcs.go b/br/pkg/storage/gcs.go
index 0f1d8a2418..b4893657ba 100644
--- a/br/pkg/storage/gcs.go
+++ b/br/pkg/storage/gcs.go
@@ -432,7 +432,7 @@ func (s *GCSStorage) Reset(ctx context.Context) error {
 			if err != nil {
 				return errors.Trace(err)
 			}
-			client.SetRetry(storage.WithErrorFunc(shouldRetry))
+			client.SetRetry(storage.WithErrorFunc(shouldRetry), storage.WithPolicy(storage.RetryAlways))
 			s.clients[i] = client
 			return nil
 		})

but I'm not sure if we should use this easy solution which "can lead to race conditions and other conflicts", or properly making it idempotent by supplying the ifMetagenerationMatch/ifGenerationMatch preconditions.

@lance6716 lance6716 added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Sep 18, 2024
@kennytm kennytm added found/gs found by gs report/customer Customers have encountered this bug. labels Sep 20, 2024
@BornChanger
Copy link
Contributor

@Benjamin2037 will your team fix the problem?

@Benjamin2037
Copy link
Collaborator

@BornChanger OK

@OliverS929
Copy link
Contributor

I can take a look at this, but please let me know if it is urgent.

@kennytm
Copy link
Contributor Author

kennytm commented Sep 20, 2024

@OliverS929 We have asked customer tried using the XML API (s3://) to see if it can workaround the issue. It is currently 23:50 in SF, I don't have the urgency assessment right now.

@Yui-Song
Copy link
Contributor

The customer used local disks as a workaround. But as it's such a common scenario on GCP, it would also be helpful if this could be fixed.

@kennytm kennytm added affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. labels Sep 21, 2024
@ti-chi-bot ti-chi-bot bot added the may-affects-5.4 This bug maybe affects 5.4.x versions. label Sep 23, 2024
@BornChanger BornChanger removed the component/br This issue is related to BR of TiDB. label Oct 11, 2024
@ti-chi-bot ti-chi-bot bot added the affects-8.5 This bug affects the 8.5.x(LTS) versions. label Nov 1, 2024
@Benjamin2037 Benjamin2037 removed the may-affects-5.4 This bug maybe affects 5.4.x versions. label Nov 20, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in 84fe10a Nov 20, 2024
ti-chi-bot bot pushed a commit that referenced this issue Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.3 affects-8.5 This bug affects the 8.5.x(LTS) versions. component/dumpling This is related to Dumpling of TiDB. found/gs found by gs good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants