Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

destination-s3: don't reuse names of existing objects #45143

Conversation

stephane-airbyte
Copy link
Contributor

@stephane-airbyte stephane-airbyte commented Sep 4, 2024

Instead of counting the number of files and starting creating file based on that counter, we create files starting at 0 and avoid overriding files that were already present.

The problem was that in case of an overwrite sync, the 1st sync would create files 1, 2, 3. Sync 2 would notice there's 3 files and would create files 4, 5 , 6 and delete 1, 2, 3 at the end of the sync. Sync3 and after would see there's 3 files, would overwrite 4, 5, 6 and delete them because they were here before the sync started, leaving us with no files.

fixes #6417

Copy link

vercel bot commented Sep 4, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 5, 2024 9:25pm

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Sep 4, 2024
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @stephane-airbyte and the rest of your teammates on Graphite Graphite

@stephane-airbyte stephane-airbyte force-pushed the stephane/09-04-destination-s3_don_t_reuse_names_of_existing_objects branch 5 times, most recently from 97d6945 to 1d1843c Compare September 5, 2024 00:33
@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/destination/s3 labels Sep 5, 2024
@stephane-airbyte stephane-airbyte force-pushed the stephane/09-04-destination-s3_don_t_reuse_names_of_existing_objects branch 2 times, most recently from 640fbf4 to 0e56f8d Compare September 5, 2024 00:53
@stephane-airbyte stephane-airbyte marked this pull request as ready for review September 5, 2024 01:24
@stephane-airbyte stephane-airbyte requested review from a team as code owners September 5, 2024 01:24
objectNameByPrefix.computeIfAbsent(
objectPath,
) {
var objectList: Set<String> = setOf()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably use this method here right ?

@stephane-airbyte stephane-airbyte force-pushed the stephane/09-04-destination-s3_don_t_reuse_names_of_existing_objects branch from 0e56f8d to f812f4f Compare September 5, 2024 18:10
Copy link
Contributor

@gisripa gisripa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@stephane-airbyte
Copy link
Contributor Author

stephane-airbyte commented Sep 5, 2024

/publish-java-cdk

🕑 https://github.com/airbytehq/airbyte/actions/runs/10727098973
✅ Successfully published Java CDK version=0.44.21!

@stephane-airbyte stephane-airbyte force-pushed the stephane/09-04-destination-s3_don_t_reuse_names_of_existing_objects branch from f812f4f to 00c93c1 Compare September 5, 2024 21:20
@stephane-airbyte stephane-airbyte merged commit 6730a3b into master Sep 5, 2024
34 checks passed
@stephane-airbyte stephane-airbyte deleted the stephane/09-04-destination-s3_don_t_reuse_names_of_existing_objects branch September 5, 2024 21:38
Copy link
Contributor

@johnny-schmidt johnny-schmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense:

  • when writing a new file
    • try to get a partId for the prefix
      • start at 0
      • first time for the prefix: build a map of prefix -> { all existing full names with prefix }
      • while there's a conflict: increment and try again
    • (no need to store the object you just made, because all subsequent calls should just increment any time there's a confict)

So:

Sync 1: write(0, 1, 2)
Sync 2: write(3, 4, 5, 6) delete(0, 1, 2)
Sync 3: write(0, 1, 2, 7, 8)

etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation CDK Connector Development Kit connectors/destination/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Source Stripe: sync freeze with 80k records
4 participants