-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split data into blocks #829
Comments
For a private data sharing/deduplication tool I've used a buzhash/rolling hash for chunking big files into small chunks pretty successfully. The borg backup tool has a nice implementation of a buzhash chunker. Maybe it's worth checking out. |
@Witiko noted that there is ficlonerange that we could use to create file by reflinking smaller files into it. #2073 (comment) |
This is a pretty good solution to the chunking deduplication problem. I've used it with success for backing things up: https://duplicacy.com/. |
the way Git handles this is
So that would be another possible approach — equivalent to chunking down to the byte. Mentioned in #1487 BTW:
Either approach complicates the file linking from cache to workspace anyway. |
@efiop There is a problem with it that if the new line was inserted on top of the file, or in the middle. |
There are methods that can deal with inserts at various positions. I mentioned this earlier: Most of the chunks/blocks of a file stay the same, only chunks/blocks that change are actually added. Many backup tools use similar methods. |
For the record: another thing to consider is to revisit
so each CC @pmrowla , since we've talked about it previously. |
I just want to +1 this method. I have been doing some tests on a personal project using the restic chunker implementation, but the algorithm is not important. Testing some different datasets and large binary files I had some great results with this method. Other benefis are that these blocks are ~1-2MB and work very nicely with protocols like HTTP and the resulting data structure is still flat chunks that can be consumed without any implementation / knowledge of the generation process. I am a little nervous about other diff based approaches mentioned and the implications on garbage collection and inter-file dependencies created. Is there any current work on this and/or are you leaning towards any specific approach? I am doing personal research into external tooling that could be used on top of DVC but it would be great if generating this type of data structure was included in DVC itself. |
@bobertlo Great info! We didn't really pick a specific approach yet, as we've been mostly busy with rebuilding our data management architecture to suit any new features like that. Right now the internals are somewhat ready for working on chunking specifically (e.g. objects/checkout/staging/db/etc in dvc/objects), but we need to finish up a few other things, so we are planning to start implementing chunking in Q1 2022. Regarding 1-2MB chunks, that might get to slow to transfer, unless we pack them into some kind of packs. This problem is exactly what we are seeing with large image datasets, where you have like a million of 1-2MB images, that each currently require at least an API call to upload/download to/from cloud, which is slow. This is why we plan on tackling both problems at the same time: big file chunking and big dataset packing. Overall there is a (unconfirmed) feeling, that for our typical data sizes we probably need something bigger than 1-2MB chunk or pack size. We will be researching it closer a bit later. Btw, happy to jump on a call again to chat about it ;) |
@efiop Great! I'm excited to follow this development :) RE: the chunk size, it is fairly arbitrary. Most existing implementations have interfaces for adjusting the match threshold on the hash and/or concatenating/truncating adjacent chunks. I just want to share some proof of concept tests I ran on simulated datasets. To synthesize these, I pulled some (related) docker images, installed packages to create a couple more and then dumped them all to tarballs.
In this test the files are chunked and a list of chunks is stored as
I realize it is a BIG deal (especially with garbage collecting) to actually implement this but the results here look very promising and have really great implications on possible workflows. |
|
Hi, is there a more formal definition of this feature? I have some questions, especially about what is not included in this feature. I suppose "chunking" is just splitting the files into smaller pieces so that you do not have to store/download/upload duplicate pieces. And I suppose this does not have to do with downloading/uploading chunks in parallel from different remote storages, right?
I assume the answer is NO for all of them. And I suppose that adding this feature will make it impossible to implement this one: Cloud versioning. |
Can you give a short update on how high or low prioritized this issue is right know, please? Even a second guess of time schedule is highly appreciated. Or what are the results of your research in the current milestone? At the moment, the chunking is crucial for further usage of dvc. We are going to automate our MLOps pipeline which will update dataset of approx. 20GB only slightly based on customer feedback every week. Without chunking but with re-uploading every file entirely, this would produce an overhead of 1TB per year! |
Correct about cloud versioning. It could still be used as optimization (we can use version id for particular chunk file), but that complicates it. So this could be an option. But we'll need to take a closer look. |
@Chickenmarkus We've been working on some pre-requisites like https://github.com/iterative/dvc-data and https://github.com/iterative/dvc-objects, which will host the forementioned functionality. We might get to this in Q4, but I'm not 100% sure, as we have other more product-oriented priorities. |
@Chickenmarkus I'd appreciate it if you can share more details about your use case:
|
@dmpetrov Yes, sure. Every real feedback you have will finally serve us users. 😄
At the end, it is very challenging to efficiently chunk this randomized data by a generic approach. 🙈 |
@Chickenmarkus a feedback to your setup is below: First, We are working on a DVC-sister project LDB/LabelDB to solve the problem of managing labels. It is in an early stage but happy to sync up and give you an early access if there is an interest (please share your email). In simple case, you can just store a data CSVs and labels CSVs . Second, Delta Lake looks like a great way of versioning on top of Parque. Luckily there is a Rust implementation with Python bindings https://github.com/delta-io/delta-rs#python which makes it possible to integrate Delta to DVC (without dependencies to Spark and JVM 😅) and manage tabular data efficiently. I'd love to hear your feedback. |
Sorry for the late reply, the notification did not work first, and second I was on vacation. 🙈 Also, I would really like to thank you for your feedback! It is very helpful. I absolutely agree with you. |
Another motivation for implementing data chunking: When we manage image data for, e.g., image classification, we have a folder for each class which contains the images associated with the class. Thus, we have as many files as we have samples in the dataset which can be anywhere between O(10,000) and millions. In our case, we use an HTTP storage backend for DVC which has a rate limit of 1000 requests per 15 seconds. With highly parallel download, we are bounded by this rate limit which means for, e.g., 1,000,000 samples it would take (1,000,000 ÷ 1000) × 15s ≈ 4h. For comparison, let's assume the image size on average is 100 KB. Then 1,000,000 images would amount to 100 GB. With a conservative bandwidth of 100 MBit/s, downloading 100 GB would take 100 GB ÷ 0.0125 GB/s ≈ 2h. And if we assumed a bandwidth of 1 GBit/s, the download time would reduce to 12 min. So, given our rate limit, our effective bandwidth is severely limited. But if DVC supported data chunking, we could significantly improve our throughput with probably some but not a total loss of deduplication. |
@sisp thank you for the feedback! This problem can be solved on different levels. I have some concerns about using data chunking in this particular case because it requires splitting data by smaller chunks. For example, for 1M images we would need to split it to X (1M <= X < 100M) smaller chunks depending on image sizes and we hit the throughput limits even harder. Instead, I believe a better solution could be the exact opposite approach, where we use a single archive file (per class or just a subset of images) that can be fetched partially. This way, we could "batch" requests. For example, we can obtain all the images with just one request or retrieve a subset of I would love to discuss this further with you in our next meeting. Thanks again for your suggestion! |
@dmpetrov Thanks for your feedback! 🙏 Indeed, splitting/chunking image files even further would make the problem worse. TBH, I haven't read i to the algorithms details of some tbe above-mentioned chunkers (like the one BorgBackup is using), but I imagined the chunking wouldn't necessarily mean splitting files but, e.g. for small files, could also mean combining files. I had a picture of HDFS' 64 MB blocks in my mind somehow. But this might not make sense at all, I can't actually say without diving deeper into the algorithmic details. Could you elaborate just a little on your work towards partial retrieval of data from an archive? I'm absolutely open to managing data at a less granular level, e.g. all samples per class gathered in a single file, but currently I'd (a) loose deduplication when only some samples change because the whole file gets reuploaded, (b) I need to retrieve the whole file even when I only want a subset (which might be adressed by the work you've mentioned?), and (c) I need to decide on the granularity and content slice of the file/archive manually (which might be unavoidable because the optimal strategy might depend on data access and mutation patterns, and I don't know whether there is a generic near-optimal solution to this problem; it feels a bit related to RDMS index creation). I'm afraid I won't be able to join our meeting today because I'm already on Easter vacation. But I think the main topic is a different one anyway. I'd be happy to continue discussing here, or if you think it makes sense to also chat about it synchronously, we could schedule another meeting. 🙂 |
Sure! First, the webdataset based approach works well only for "append-only" use cases. For example, you will be adding each new batch of images as a new tar file to a class directory but not deleting the images. Physical file deletion has a huge overhead, logical (without touching the archives) is not a problem but it lies on shoulders of user.
It depends on how you "package" the archive. It is possible not to loose it and even get local file caching (per image file).
You can retrieve only files you need.
Right. It is doable and you can address files by the archive name and image file name in it.
Yes. "know the archive structure" from the above means "index".
Let me chat with your folks and it can clarify what is the best way for us to communicate 🙂 Happy easter 🐰 |
Thanks for your detailed reply, this is very helpful. 🙏 Managing files in an archive including batched download of adjacent files etc. is something you're working on adding to DVC, but it's not yet possible. Right? The append-only property sounds like a reasonable requirement to me. Typically, new data becomes available, gets appended to a dataset, and becomes available in a new release. For structured data, when the data structure changes (new field/column, changed representation of a field/column, etc.), all samples might need to be updated and uploaded without an opportunity for deduplication anyway. The only case I'm still a bit concerned about is fixing some samples in a datset, e.g. when NaN values remained in v1.0.0 and were fixed (by whatever method) in v1.0.1. In this case, only some samples would be updated. With one sample per file, only the updated files would get uploaded again, but with archives I imagine the whole archive would have to be uploaded although most samples might not have changed. Right?
Sounds good. 👌 Happy Easter to you, too. 🐰 |
Well, it is actually DVC-sister project that we are working on. The new product is designed for genAI use cases.
Great! And you are right about the data fixes. However, the majority of fixes are happening on meta-data/label level rather than files/images. The meta-data fixes is not an issue at all if you manage it separately from the data (which is a fairly common pattern).
It depends on use cases. In some cases, virtually removing a subset of images from a dataset is enough without any changes in the files or archives. All can be done in the meta data level. PS: I had a great chat with your teammate recently. I hope to see you both in the next meeting 🙂 |
Hello, are there any news about this feature? |
As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.
The text was updated successfully, but these errors were encountered: