-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
directory download and uploads are slow #6222
Comments
DVC push/pull performance for many small files is a known limitation. But a part of the issue is probably also specific to the HTTP remote type, it's possible you would get better performance pulling from the registry using S3 rather than HTTP (due to remote fs implementation differences) For the purpose of the example projects, it may be better to just handle compressing/extracting images as an archive within the example pipelines.
This does not really work as a general solution for DVC, as we would lose the ability to de-duplicate files between directory versions. |
This is also an issue that can be addressed by future cache structure changes (#829). Ideally with those kinds of cache changes we will be pushing and pulling a small # of large packed objects rather than a large # of small loose objects (so the end result is similar to the difference you see now between downloading a single tar vs 70k individual images) |
Thank you @pmrowla
Sure. I researched a bit but couldn't find a definite answer: Does
Having an option to track immutable data directories as single tar files might have some benefit. I agree though this benefit may not cover the required architecture changes. This multiple small objects in a directory issue needs to be resolved for #829 as well. Otherwise the cache itself will be blobs of large objects and that will probably bring some other problems. |
@pmrowla what about @iesahin 's suggestions :
WDYT? My experience with S3 and small objects and DVC. It seems it can be fast (s5cmd) even on small objects, Not saying that we should not be migrating to the new cache structure, but it feels that we could significantly improve it even before that. And that migration might take a long time.
My take on this - by doing this, we pretty much in the get started saying that we don't handle even small datasets. Compressing all images looks quite artificial and people will be asking why are we doing this. Also it complicates the document. |
When we used the old "# of cores x2" method (without a hard cap on the default) for setting the number of threads, it caused problems for users running on clusters that have very high #'s of cores.
We only use our own connection pooling for specific remote types (HDFS and SSH), since pyarrow and paramiko don't provide their own.
DVC does not create new session objects for each file being downloaded. Line 72 in 4e792ae
_session is a cached property, and is only created once per filesystem instance
Regarding HTTP pipelining, do the S3 webservers even support it properly? HTTP/1.1 pipelining isn't supported in most modern browsers due to the general lack of adoption and/or incorrect implementations in servers. This is very old: https://stackoverflow.com/questions/7752802/does-s3-support-http-pipelining but I can't find anything newer, which seems to indicate that the answer is probably still "no" In general, my understanding is that w/the move to using async instead of python thread pools (and aiobotocore) we should be able to start getting throughputs that are much faster than native aws-cli/boto, @isidentical should have some more insight on that |
For the other async operations, yes that is true. But the problem with a lot of small files vs a single big file is that, the relative significance of the processing stage (e.g parsing the API response for each object) to the actual data transport is much higher on small files compared to a single big file, and those steps are blocking for async. One possible (and kind of tricky) route would be (which I believe there are still some places in DVC and fsspec itself that are not compatible) creating a process-pool and attaching each process it's own event loop, which would scale these blocking operations to the number of cores and when used with async should maximize the throughput. (btw we still use regular boto for the push/pull). |
Regarding pipelining, as @pmrowla said, it's not very well supported, and users will run into issues with proxy as well. buggy proxies, marginal improvements, HOL blocking ... It's considered a design mistake and has been superseded by multiplexing in http/2. Though we won't find much benefits with this unless we go async + http/2, which we should ( |
Is there any data regarding the adoption rate of HTTP/2 vs HTTP/1.1 pipelining? |
S3 does not directly support HTTP/2 unless you are serving it behind CloudFront (or your own proxy server). Also, just to be clear, HTTP/1.1 pipelining is not the same thing as re-using HTTP connections via the Keep-alive Header. |
Are we planning to migrate http to fsspec and/or optimize it?
Do you mean fsspec It seems like we should be able to fairly easily migrate to fsspec/ |
I can test HTTP pipelining with S3 with a script, if you'd like. If I can come up with faster download with a single thread, we can talk about implementing it in the core. |
I think moving to
I don't think it'll make much difference unless we go async. |
We discussed this in planning, and there are a few different items to address in this issue:
I did a dirty performance comparison:
So my rough estimate is that
|
Excellent summary, thanks guys! thanks Dave! |
One question though - why don't we increase number of workers by default? I think we should be a bit more aggressive. Also, one more experiment - what happens with |
It is too dangerous to just set it to something bigger without having a proper throttling mechanism, or we will start getting timeout errors or "too many files" errors. Throttling should be handled by each filesystem's |
I don't know the case for S3, probably the limit is much higher or it's only limited by the bandwidth, but HTTP/1.1 protocol doesn't allow more than 2 connections per client to a server. Most servers set this to 4 or a bit higher, but I think the limitation is not the number of jobs or workers, but the server limits. IMO DVC should be conservative by default, a first experience with |
I think that's obsoleted by rfc7230.
|
Yep, it would be really great to have some dynamic workload balancing/throttle/etc. But even w/o it we could use |
TLDR: It looks like the multiple processes launched by s5cmd do improve performance, whereas the additional threads in dvc don't seem to be helping much.
|
If there is an easy way to measure the total throughput in
It is environment dependent but I doubt the number of jobs increases above 10 in most cases. |
Ah, thank you @skshetry It's again a low number though. |
My comparison between
These are with 70.000 identical files. The default |
Removing the p1 label here because this is clearly not going to be closed in the next sprint or so, which should be the goal for a p1. However, it's still a high priority that we will continue to be working to improve. |
Are there plans to address this soon? We have run into scalability issues with DVC due to the poor performance on data sets with a large number of small files. |
@series8217, what remote do you use? What is the size of datasets (no. of files, total size, etc)? Thanks. |
s3 remote with a local cache on a reflink-supported filesystem. ~3 million tracked files 3TB total. When tracked across 1000 dvc files, I suspect the checksumming step always takes the same amount of time, but the s3 download of the .dvc files is where the extra hour and a half comes from. Given that the .dvc files are only a few hundred bytes each, this should take only a few minutes. |
Could it be given as an option? In the same way that you can specify "cache"/"push"/... on "output entries", could there be an option "granular: false" (default would be "granular: true" which corresponds to the current behavior)? With the non-granular option, the directory would be considered as an implicit-tar, you invalidate/download/upload all of it at once. Using DVC, there are some cases where we do like the granularity of DVC for directories. But other cases where, normally, either all files have changed or none have changed (e.g. the output of our segmentation networks on our benchmark dataset). We handle this by asking DVC to track archives of our non-granular directories, and directly the directory for our granular directories. We did get a big performance boost, but it adds a small friction to navigate in the directory. Such an option would be great in our use case. |
Bug Report
get
/fetch
/pull
/push
... so I didn't put a tag.Description
We have added a directory containing 70,000 small images to the Dataset Registry. There is also a
tar.gz
version of the dataset which is downloaded quickly:time dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz dvc get https://github.com/iterative/dataset-registry mnist/images.tar.gz 3.41s user 1.36s system 45% cpu 10.411 total
When I issue:
I get ~16 hours ETA for 70.000 downloads in my VPS.
This is reduced to ~3 hours on my faster local machine.
I didn't wait to finish these, so the real times may be different but you get the idea.
For
-j 10
it doesn't differ much:dvc pull
is better, it's takes about 20-25 minutes.(At this point, while writing a new version released and the rest of the report is in
2.4.1
😄 )dvc pull -j 100
seems to reduce the ETA to 10 minutes.(I waited for
dvc pull -j 100
to finish and it took ~15 minutes.)I also had this issue while uploading the data in iterative/dataset-registry#18 and we have a discussion there.
Reproduce
or
Expected
We will use this dataset (and
fashion-mnist
similar to this) in example repositories, we would like to have some acceptable time (<2 minutes) for the whole directory to download.Environment information
Output of
dvc doctor
:Some of this report is with
2.3.0
but currently:Discussion
DVC uses new
requests.Session
objects in connection and this requires new HTTP(S) connection for each file. Although the files are small, establishing a new connection for each file takes time.There is a mechanism in HTTP/1.1 to use the same connection. but
requests
doesn't support it..Note that increasing the number of jobs doesn't make much difference, because servers usually limit the number of connections per IP. Even if you have 100 threads/processes to download, it's probably a small number (~4-8) of these can be connected at a time. (I'm banned from AWS once while testing the commands with large
-j
.)There may be 2 solutions for this:
DVC can consider directories as implicit
tar
archives. Instead of a directory containing many files, it works with a single tar file per directory in the cache and expands them incheckout
.tar
andgzip
are supported in Python standard library. This probably requires allRepo
class to be updated though.Instead of
requests
, DVC can use a custom solution or another library likedugong
that supports HTTP pipelining. I didn't test any HTTP pipelining solution in Python, so I can't vouch for any of them but this may be better for all asynchronous operations using HTTP(S).The text was updated successfully, but these errors were encountered: