-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repro: S3 ETag Mismatch #5507
Comments
Hi @mikeaadd ! I see that you are using 2.0 pre-release. Was your pipeline originally created with 1.x dvc? If so, does |
@mikeaadd Also, I assume you have an output that is directly pointing to s3://..., right? Just a sanity check from my side. I can see that you are using external s3 cache, so I suppose you are using an external output too. |
Ive tried with 1.11.16 and i get the same errors. Yes, my output is pointing to a S3 bucket. Is this not a common workflow? I am still in the evaluation phase of using dvc. |
I dont really understand how I would be able to import-url for my use-case. I have a long pipeline with many stages that uses a few AWS services. Because I'm using these services it requires an S3 input and S3 output. it seems wasteful to download and upload the outputs of each stage so that its local. The data isn't even so large that it can fit on my local machine it's just set up in a way where it makes sense that there is external dependencies and outputs. I guess what's the recommended workflow for a pipeline that goes like this... s3data --AWSTransform-- > s3 data --AWSTransform--> S3data --> sync to local |
Is dvc running locally and kicking off jobs that are all executing in aws resources? How are those resources accessing the data in s3? |
Correct, that is a current workflow for a project. Local data is uploaded to s3. An aws service transforms that data to a different s3 location and then again. Finally its used locally for more transformations. All access is managed with IAM policies. |
What kind of aws services? Are they ec2 instances on which you could run dvc? Do they download the data from s3 to do transformations? |
For example AWS transcribe. This is a speech to text ASR that takes audio data from s3 and then writes to S3. Edit: To clarify its a managed service so ec2 instances are not needed. you can kick off those jobs from your local machine but all processing is managed by aws. |
Got it, thanks! Unfortunately, I don't have a fix for you right now. Not sure if @efiop has any ideas. Better support for external data has been coming up frequently in discussions lately to support workflows like yours. It's definitely a need that I'm eager to prioritize, but we don't have a timeline for it yet. |
@mikeaadd Btw, how big are |
so yes for the poc everything has been kept in the same bucket. train.csv and test.csv are quite small (< 1 MB). |
@mikeaadd Thanks! We do a few tricks to try to preserve the etag, and it seems like something went very wrong here, I'm not quite sure why. Is there a chance some process was continuing writing to that file after the process finished? That would explain it. |
It says it was the |
@mikeaadd I mean that maybe a worker that was writing to that file kept writing to it even after the command exited. E.g. local process exited, but ec2 worker kept writing to it, so when dvc started saving the file it changed between initial etag read and actual saving. |
I think ^ is still a likely cause, as we didn't receieve any similar reports nor we were able to reproduce this. Closing for now. Please let us know if you are still running into this issue. |
Bug Report
repro: S3 ETag Mismatch
Description
on dvc repro, I get the following error.
I expect to be able to reproduce my pipeline.
Reproduce
I am unsure of how to reproduce this example without giving you my entire pipeline. Whenver I try to get the etag issue similar to #2701. That issue was for minio but this is for aws S3. The conclusion of that ticket was that it was not supported at the time but aws should be, right?
Environment information
Additional Information (if any):
The text was updated successfully, but these errors were encountered: