Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: S3 ETag Mismatch #5507

Closed
mikeaadd opened this issue Feb 22, 2021 · 16 comments
Closed

repro: S3 ETag Mismatch #5507

mikeaadd opened this issue Feb 22, 2021 · 16 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@mikeaadd
Copy link

Bug Report

repro: S3 ETag Mismatch

Description

on dvc repro, I get the following error.

ERROR: failed to reproduce 'dvc.yaml': ETag mismatch detected when copying file to cache! (expected: '529da0df19c85897bcca4cca7412ba27', actual: '481259308337a48e8ee23ebb5cbc7df1')

I expect to be able to reproduce my pipeline.

Reproduce

dvc repro

I am unsure of how to reproduce this example without giving you my entire pipeline. Whenver I try to get the etag issue similar to #2701. That issue was for minio but this is for aws S3. The conclusion of that ticket was that it was not supported at the time but aws should be, right?

Environment information

$ dvc version
DVC version: 2.0.0a0+9ae8bc 
---------------------------------
Platform: Python 3.7.9 on Darwin-19.6.0-x86_64-i386-64bit
Supports: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5
Caches: local, s3
Remotes: s3, s3
Workspace directory: apfs on /dev/disk1s5
Repo: dvc, git

Additional Information (if any):

$ dvc repro data --verbose
2021-02-22 15:28:05,792 DEBUG: Check for update is enabled.
2021-02-22 15:28:05,837 DEBUG: Assuming 'data' to be a stage inside 'dvc.yaml'
2021-02-22 15:28:05,874 DEBUG: Lockfile 'dvc.lock' needs to be updated.
2021-02-22 15:28:06,627 DEBUG: Dependency 'src/data/create_dataset.py' of stage: 'reviews' changed because it is 'modified'.
2021-02-22 15:28:06,627 DEBUG: stage: 'reviews' changed.
2021-02-22 15:28:06,629 DEBUG: Removing output 'data/interim/reviewer_labels.csv' of stage: 'reviews'.
2021-02-22 15:28:06,629 DEBUG: Removing 'data/interim/reviewer_labels.csv'
2021-02-22 15:28:06,636 DEBUG: {}
2021-02-22 15:28:06,637 DEBUG: {'src/data/create_dataset.py': 'modified'}
Running stage 'reviews':
> python src/data/create_dataset.py reviews data/external/reviewer_labels --destination data/interim/reviewer_labels.csv
2021-02-22 15:28:08,053 DEBUG: {}                                     
2021-02-22 15:28:08,053 DEBUG: Output 'data/interim/reviewer_labels.csv' didn't change. Skipping saving.
2021-02-22 15:28:08,054 DEBUG: Computed stage: 'reviews' md5: '84162bb59c0eb31ed869cf0ead0b04fd'
2021-02-22 15:28:08,074 DEBUG: Checking out 'data/interim/reviewer_labels.csv' with cache 'object md5: f5a2578e7fc9afce789b81635ee9d812'.
2021-02-22 15:28:08,094 DEBUG: Created 'reflink': .dvc/cache/.cache_type_test_file -> data/interim/.S6tnrhYj2VuXNhMwgjaBXH
2021-02-22 15:28:08,095 DEBUG: Removing 'data/interim/.S6tnrhYj2VuXNhMwgjaBXH'
2021-02-22 15:28:08,095 DEBUG: Removing '.dvc/cache/.cache_type_test_file'
2021-02-22 15:28:08,097 DEBUG: Removing 'data/interim/reviewer_labels.csv'
2021-02-22 15:28:08,100 DEBUG: Created 'reflink': .dvc/cache/f5/a2578e7fc9afce789b81635ee9d812 -> data/interim/reviewer_labels.csv
2021-02-22 15:28:08,105 DEBUG: stage: 'reviews' was reproduced
Updating lock file 'dvc.lock'

Stage 'sync-transcripts' didn't change, skipping                                                                                                                                                      
2021-02-22 15:28:14,051 DEBUG: Dependency 'data/interim/reviewer_labels.csv' of stage: 'data' changed because it is 'modified'.
2021-02-22 15:28:14,051 DEBUG: stage: 'data' changed.
2021-02-22 15:28:14,056 DEBUG: Removing output 's3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline/train.csv' of stage: 'data'.
2021-02-22 15:28:14,056 DEBUG: Removing s3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline/train.csv
2021-02-22 15:28:14,694 DEBUG: Removing output 's3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline/test.csv' of stage: 'data'.
2021-02-22 15:28:14,694 DEBUG: Removing s3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline/test.csv
Running stage 'data':
> python src/data/create_dataset.py training s3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline
2021-02-22 15:28:19,312 - INFO - NumExpr defaulting to 8 threads.
2021-02-22 15:28:19,684 - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2021-02-22 15:28:29,925 DEBUG: {'s3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline/train.csv': 'modified'}
2021-02-22 15:28:31,307 DEBUG: {'s3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline/test.csv': 'modified'}
2021-02-22 15:28:31,825 DEBUG: Computed stage: 'data' md5: '49c480a1a76ef66163f70c714ef77958'
2021-02-22 15:28:33,682 ERROR: failed to reproduce 'dvc.yaml': ETag mismatch detected when copying file to cache! (expected: '7a9ff784ff0eda4b170dc5a3cd6a9203', actual: 'b8631b312a64b2e7ddcf6c85f1bd23f0')
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/repo/reproduce.py", line 192, in _reproduce_stages
    ret = _reproduce_stage(stage, **kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/repo/reproduce.py", line 39, in _reproduce_stage
    stage = stage.reproduce(**kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/stage/__init__.py", line 407, in reproduce
    self.run(**kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/stage/__init__.py", line 522, in run
    self.commit(allow_missing=allow_missing)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/stage/__init__.py", line 483, in commit
    out.commit(filter_info=filter_info)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/output/base.py", line 321, in commit
    objects.save(self.odb, obj)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/objects/__init__.py", line 193, in save
    obj.save(odb, **kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/objects/__init__.py", line 74, in save
    self.src.save(odb, **kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/objects/__init__.py", line 57, in save
    odb.add(self.path_info, self.fs, self.hash_info, **kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/objects/db/base.py", line 56, in add
    self.fs.move(path_info, cache_info)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/fs/base.py", line 203, in move
    self.copy(from_info, to_info)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/fs/s3.py", line 329, in copy
    self._copy(s3.meta.client, from_info, to_info, self.extra_args)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/fs/s3.py", line 422, in _copy
    raise ETagMismatchError(etag, cached_etag)
dvc.exceptions.ETagMismatchError: ETag mismatch detected when copying file to cache! (expected: '7a9ff784ff0eda4b170dc5a3cd6a9203', actual: 'b8631b312a64b2e7ddcf6c85f1bd23f0')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/main.py", line 50, in main
    ret = cmd.run()
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/command/repro.py", line 14, in run
    stages = self.repo.reproduce(**self._repro_kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 14, in run
    return method(repo, *args, **kw)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/repo/reproduce.py", line 131, in reproduce
    return _reproduce_stages(self.graph, list(stages), **kwargs)
  File "/Users/Michael/anaconda3/envs/doj/lib/python3.7/site-packages/dvc/repo/reproduce.py", line 209, in _reproduce_stages
    raise ReproductionError(stage.relpath) from exc
dvc.exceptions.ReproductionError: failed to reproduce 'dvc.yaml'
------------------------------------------------------------
2021-02-22 15:28:33,693 DEBUG: Analytics is enabled.
2021-02-22 15:28:33,854 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/6v/0_3h20ps0ylgrtn2vw233yc40000gp/T/tmp4un9iw_g']'
2021-02-22 15:28:33,857 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/6v/0_3h20ps0ylgrtn2vw233yc40000gp/T/tmp4un9iw_g']'

@efiop
Copy link
Contributor

efiop commented Feb 23, 2021

Hi @mikeaadd !

I see that you are using 2.0 pre-release. Was your pipeline originally created with 1.x dvc? If so, does dvc repro work with 1.x?

@efiop
Copy link
Contributor

efiop commented Feb 23, 2021

@mikeaadd Also, I assume you have an output that is directly pointing to s3://..., right? Just a sanity check from my side. I can see that you are using external s3 cache, so I suppose you are using an external output too.

@mikeaadd
Copy link
Author

Ive tried with 1.11.16 and i get the same errors. Yes, my output is pointing to a S3 bucket. Is this not a common workflow? I am still in the evaluation phase of using dvc.

@efiop
Copy link
Contributor

efiop commented Feb 23, 2021

@mikeaadd Ah, I see. Yeah, it is a rather advanced workflow that we don't recommend. Maybe you are looking for #4520 or dvc import-url --to-remote?

@mikeaadd
Copy link
Author

mikeaadd commented Mar 3, 2021

I dont really understand how I would be able to import-url for my use-case. I have a long pipeline with many stages that uses a few AWS services. Because I'm using these services it requires an S3 input and S3 output. it seems wasteful to download and upload the outputs of each stage so that its local. The data isn't even so large that it can fit on my local machine it's just set up in a way where it makes sense that there is external dependencies and outputs. I guess what's the recommended workflow for a pipeline that goes like this...

s3data --AWSTransform-- > s3 data --AWSTransform--> S3data --> sync to local

@dberenbaum
Copy link
Collaborator

Is dvc running locally and kicking off jobs that are all executing in aws resources? How are those resources accessing the data in s3?

@mikeaadd
Copy link
Author

mikeaadd commented Mar 5, 2021

Correct, that is a current workflow for a project. Local data is uploaded to s3. An aws service transforms that data to a different s3 location and then again. Finally its used locally for more transformations. All access is managed with IAM policies.

@dberenbaum
Copy link
Collaborator

What kind of aws services? Are they ec2 instances on which you could run dvc? Do they download the data from s3 to do transformations?

@mikeaadd
Copy link
Author

mikeaadd commented Mar 5, 2021

For example AWS transcribe. This is a speech to text ASR that takes audio data from s3 and then writes to S3.

Edit: To clarify its a managed service so ec2 instances are not needed. you can kick off those jobs from your local machine but all processing is managed by aws.

@dberenbaum
Copy link
Collaborator

Got it, thanks! Unfortunately, I don't have a fix for you right now. Not sure if @efiop has any ideas.

Better support for external data has been coming up frequently in discussions lately to support workflows like yours. It's definitely a need that I'm eager to prioritize, but we don't have a timeline for it yet.

@efiop
Copy link
Contributor

efiop commented Mar 5, 2021

@mikeaadd Btw, how big are s3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline/train.csv and s3://pi-global-ext-stg-feeds/data/stg_ext_cct_tm/doj/comprehend-multiclass-baseline/test.csv ? Also, could you show dvc config --list output? I'm specifically interested in wether s3 cache is in same bucket as the outputs (pi-global-ext-stg-feeds).

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Mar 5, 2021
@mikeaadd
Copy link
Author

mikeaadd commented Mar 5, 2021

[core]
    remote = storage
[cache]
    s3 = s3cache
['remote "storage"']
    url = s3://pi-global-ext-stg-feeds/data/dvcstore
['remote "s3cache"']
    url = s3://pi-global-ext-stg-feeds/data/dvclocalcache

so yes for the poc everything has been kept in the same bucket. train.csv and test.csv are quite small (< 1 MB).

@efiop
Copy link
Contributor

efiop commented Mar 6, 2021

@mikeaadd Thanks! We do a few tricks to try to preserve the etag, and it seems like something went very wrong here, I'm not quite sure why. Is there a chance some process was continuing writing to that file after the process finished? That would explain it.

@mikeaadd
Copy link
Author

mikeaadd commented Mar 7, 2021

It says it was the dvc.yaml that failed to reproduce. I wasn't running multiple dvc processes at the same time so it seems unlikely to me.

@efiop
Copy link
Contributor

efiop commented Mar 7, 2021

@mikeaadd I mean that maybe a worker that was writing to that file kept writing to it even after the command exited. E.g. local process exited, but ec2 worker kept writing to it, so when dvc started saving the file it changed between initial etag read and actual saving.

@efiop
Copy link
Contributor

efiop commented Apr 14, 2021

I think ^ is still a likely cause, as we didn't receieve any similar reports nor we were able to reproduce this. Closing for now. Please let us know if you are still running into this issue.

@efiop efiop closed this as completed Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

3 participants