Split data into blocks #829

efiop · 2018-06-30T01:27:16Z

As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.

sotte · 2018-11-14T15:36:30Z

For a private data sharing/deduplication tool I've used a buzhash/rolling hash for chunking big files into small chunks pretty successfully. The borg backup tool has a nice implementation of a buzhash chunker. Maybe it's worth checking out.

efiop · 2019-06-01T04:38:50Z

@Witiko noted that there is ficlonerange that we could use to create file by reflinking smaller files into it. #2073 (comment)

salotz · 2019-12-12T03:11:39Z

This is a pretty good solution to the chunking deduplication problem. I've used it with success for backing things up: https://duplicacy.com/.
If anything you could use the algorithm.

jorgeorpinel · 2020-09-02T21:30:43Z

the way Git handles this is

looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next
See https://git-scm.com/book/en/v2/Git-Internals-Packfiles

So that would be another possible approach — equivalent to chunking down to the byte. Mentioned in #1487 BTW:

Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.

Either approach complicates the file linking from cache to workspace anyway.

karajan1001 · 2020-09-27T10:56:34Z

As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.

@efiop There is a problem with it that if the new line was inserted on top of the file, or in the middle.

sotte · 2020-09-27T11:58:48Z

problem with it that if the new line was inserted on top of the file, or in the middle.

There are methods that can deal with inserts at various positions. I mentioned this earlier:
#829 (comment)

Most of the chunks/blocks of a file stay the same, only chunks/blocks that change are actually added. Many backup tools use similar methods.

efiop · 2020-10-18T08:56:54Z

For the record: another thing to consider is to revisit .dir format to be able to split it into smaller files. E.g.
123456.dir could look like (it might still be json or not)

{
   "dir": "654321.dir"
   "foo": "...",
   "bar": "...",
}

so each .dir file contains a listing of the directory (e.g. os.listdir), which will result in us being able to optimize cases when you are adding, say, 1 file to a giant dataset. Currently, that will result in a new .dir file that might be a new like a 30M json file, which might be more than the file you've added.

CC @pmrowla , since we've talked about it previously.

bobertlo · 2021-11-10T17:32:33Z

For a private data sharing/deduplication tool I've used a buzhash/rolling hash for chunking big files into small chunks pretty successfully. The borg backup tool has a nice implementation of a buzhash chunker. Maybe it's worth checking out.

I just want to +1 this method. I have been doing some tests on a personal project using the restic chunker implementation, but the algorithm is not important. Testing some different datasets and large binary files I had some great results with this method.

Other benefis are that these blocks are ~1-2MB and work very nicely with protocols like HTTP and the resulting data structure is still flat chunks that can be consumed without any implementation / knowledge of the generation process.

I am a little nervous about other diff based approaches mentioned and the implications on garbage collection and inter-file dependencies created. Is there any current work on this and/or are you leaning towards any specific approach? I am doing personal research into external tooling that could be used on top of DVC but it would be great if generating this type of data structure was included in DVC itself.

efiop · 2021-11-10T18:03:42Z

@bobertlo Great info! We didn't really pick a specific approach yet, as we've been mostly busy with rebuilding our data management architecture to suit any new features like that. Right now the internals are somewhat ready for working on chunking specifically (e.g. objects/checkout/staging/db/etc in dvc/objects), but we need to finish up a few other things, so we are planning to start implementing chunking in Q1 2022.

Regarding 1-2MB chunks, that might get to slow to transfer, unless we pack them into some kind of packs. This problem is exactly what we are seeing with large image datasets, where you have like a million of 1-2MB images, that each currently require at least an API call to upload/download to/from cloud, which is slow. This is why we plan on tackling both problems at the same time: big file chunking and big dataset packing. Overall there is a (unconfirmed) feeling, that for our typical data sizes we probably need something bigger than 1-2MB chunk or pack size. We will be researching it closer a bit later.

Btw, happy to jump on a call again to chat about it ;)

bobertlo · 2021-11-11T14:55:57Z

@efiop Great! I'm excited to follow this development :) RE: the chunk size, it is fairly arbitrary. Most existing implementations have interfaces for adjusting the match threshold on the hash and/or concatenating/truncating adjacent chunks.

I just want to share some proof of concept tests I ran on simulated datasets. To synthesize these, I pulled some (related) docker images, installed packages to create a couple more and then dumped them all to tarballs.

$ ls -sh images/
total 9.6G
670M base.tar  2.7G dvc.tar  2.5G librosa.tar  1.4G minimal.tar  2.5G scipy.tar

In this test the files are chunked and a list of chunks is stored as [original file hash].chunks and chunks are stored as [chunk file hash]. I did one run with plain chunks and another run with zstd streaming on each chunk for fun.

$ du -sh testrepo*
3.3G    testrepo
855M    testrepo-zstd

$ find testrepo -type f | wc -l
3871

I realize it is a BIG deal (especially with garbage collecting) to actually implement this but the results here look very promising and have really great implications on possible workflows.

karajan1001 · 2021-11-12T09:11:19Z

zstd, Yes this reminds me that in previous we didn't use compression for data storage because it will make the current quick checkout disabled. As the splitting of data will also break the quick checkout, maybe it is time to reconsider it? As it will save the storage and accelerate the data transfer?

josecelano · 2022-07-11T08:51:39Z

Hi, is there a more formal definition of this feature? I have some questions, especially about what is not included in this feature.

I suppose "chunking" is just splitting the files into smaller pieces so that you do not have to store/download/upload duplicate pieces.

And I suppose this does not have to do with downloading/uploading chunks in parallel from different remote storages, right?

Does DVC support transferring files in parallel from the same remote storage? (current state without chunks)
Dows DVC support transferring files in parallel from the all remote storages? (current state without chunks)
Does DVC support transferring chunks in parallel from the same remote storage?
Does DVC support transferring chunks in parallel from all remote storages?

I assume the answer is NO for all of them.

And I suppose that adding this feature will make it impossible to implement this one: Cloud versioning.

Chickenmarkus · 2022-07-11T11:10:35Z

@bobertlo [W]e are planning to start implementing chunking in Q1 2022.

Can you give a short update on how high or low prioritized this issue is right know, please? Even a second guess of time schedule is highly appreciated. Or what are the results of your research in the current milestone?

At the moment, the chunking is crucial for further usage of dvc. We are going to automate our MLOps pipeline which will update dataset of approx. 20GB only slightly based on customer feedback every week. Without chunking but with re-uploading every file entirely, this would produce an overhead of 1TB per year!

efiop · 2022-07-11T11:51:35Z

@josecelano

yes
no, we still process remotes one-by-one, but everything from one remote is transfered in parallel
yes, same as 1
no, same as 2

Correct about cloud versioning. It could still be used as optimization (we can use version id for particular chunk file), but that complicates it. So this could be an option. But we'll need to take a closer look.

efiop · 2022-07-11T11:52:59Z

@Chickenmarkus We've been working on some pre-requisites like https://github.com/iterative/dvc-data and https://github.com/iterative/dvc-objects, which will host the forementioned functionality. We might get to this in Q4, but I'm not 100% sure, as we have other more product-oriented priorities.

dmpetrov · 2022-07-11T15:56:32Z

At the moment, the chunking is crucial for further usage of dvc. We are going to automate our MLOps pipeline which will update dataset of approx.

@Chickenmarkus I'd appreciate it if you can share more details about your use case:

What is the type and format of data: csv, images or something else.
How does a typical data update looks like: append data to a file or change in any part of the file?

Chickenmarkus · 2022-07-12T14:10:04Z

@dmpetrov Yes, sure. Every real feedback you have will finally serve us users. 😄

i) type & ii) format
1. We have records which consists of multi-line text, numbers, some list of numbers which represents encoded tags and discrete labels. For our use case, the records are unordered (no ordering by ID or timestamp possible).
2. For now, we use and are fine with CSV. The multi-line text can extend a single record to several lines. Due to the size of the dataset, it is splitted into several CSV files.
  Some binary formats such as Parquet or TFrecords are in discussion in order to ensure types or optimize lazy loading if the dataset does not fit into memory anymore.
We have two typical patterns of change:
1. Append new records. This is the very most often one.
  Creating a new file every time we append new records so that the new file is exclusively uploaded gets a mess when it comes to ii.
2. Update the label of an existing record. Basically, that's changing a few characters in any single line of the CSV.

At the end, it is very challenging to efficiently chunk this randomized data by a generic approach. 🙈
However, the overhead of versioning big files will significantly be reduced if 2.i is covered at least.

dmpetrov · 2022-07-26T10:56:24Z

@Chickenmarkus a feedback to your setup is below:

First, Update the label of an existing record can be a root cause of multiple issues. Data and labels have a different lifecycle. The best practice is the separation of labels from data and joining them when it is needed.

We are working on a DVC-sister project LDB/LabelDB to solve the problem of managing labels. It is in an early stage but happy to sync up and give you an early access if there is an interest (please share your email).

In simple case, you can just store a data CSVs and labels CSVs .

Second, Parquet or TFrecords is the right direction to optimize the evolving tabular data formats. However, the versioning still can be not easy to get out of the box.

Delta Lake looks like a great way of versioning on top of Parque. Luckily there is a Rust implementation with Python bindings https://github.com/delta-io/delta-rs#python which makes it possible to integrate Delta to DVC (without dependencies to Spark and JVM 😅) and manage tabular data efficiently.

I'd love to hear your feedback.

Chickenmarkus · 2022-08-26T14:22:30Z

@Chickenmarkus a feedback to your setup is below:
[...]
I'd love to hear your feedback.

Sorry for the late reply, the notification did not work first, and second I was on vacation. 🙈

Also, I would really like to thank you for your feedback! It is very helpful. I absolutely agree with you.
The separation of data and labels is completely new to me but it seems to be the right way. I only see the reasons for that you already mentioned. There comes nothing to my mind where this would break the workflow or similar. I will forward this proposal into the company. Also, thank you for your offer about the early access to LabelDB but I have already to many projects in progress. 😮‍💨
I also know Delta Lake yet - a really fancy technology. However, we also have to care about "How to read the data into Tensorflow batch-wise if it does not fits into memory." Tensorflow comes with already implemented importers for CSV or TFRecords. This saves a lot of development work on our side. 👼

sisp · 2023-03-23T08:12:57Z

Another motivation for implementing data chunking:

When we manage image data for, e.g., image classification, we have a folder for each class which contains the images associated with the class. Thus, we have as many files as we have samples in the dataset which can be anywhere between O(10,000) and millions. In our case, we use an HTTP storage backend for DVC which has a rate limit of 1000 requests per 15 seconds. With highly parallel download, we are bounded by this rate limit which means for, e.g., 1,000,000 samples it would take (1,000,000 ÷ 1000) × 15s ≈ 4h. For comparison, let's assume the image size on average is 100 KB. Then 1,000,000 images would amount to 100 GB. With a conservative bandwidth of 100 MBit/s, downloading 100 GB would take 100 GB ÷ 0.0125 GB/s ≈ 2h. And if we assumed a bandwidth of 1 GBit/s, the download time would reduce to 12 min.

So, given our rate limit, our effective bandwidth is severely limited. But if DVC supported data chunking, we could significantly improve our throughput with probably some but not a total loss of deduplication.

dmpetrov · 2023-04-05T02:08:20Z

@sisp thank you for the feedback!

This problem can be solved on different levels. I have some concerns about using data chunking in this particular case because it requires splitting data by smaller chunks. For example, for 1M images we would need to split it to X (1M <= X < 100M) smaller chunks depending on image sizes and we hit the throughput limits even harder.

Instead, I believe a better solution could be the exact opposite approach, where we use a single archive file (per class or just a subset of images) that can be fetched partially. This way, we could "batch" requests. For example, we can obtain all the images with just one request or retrieve a subset of N images with M <= N queries (more images you ask for - smaller M you get). We are currently working on a solution using the webdataset format, which is based on unarchived tar files and allows partial retrieval of files from S3/GCP/ADLS/NFS.

I would love to discuss this further with you in our next meeting. Thanks again for your suggestion!

sisp · 2023-04-05T06:10:30Z

@dmpetrov Thanks for your feedback! 🙏 Indeed, splitting/chunking image files even further would make the problem worse. TBH, I haven't read i to the algorithms details of some tbe above-mentioned chunkers (like the one BorgBackup is using), but I imagined the chunking wouldn't necessarily mean splitting files but, e.g. for small files, could also mean combining files. I had a picture of HDFS' 64 MB blocks in my mind somehow. But this might not make sense at all, I can't actually say without diving deeper into the algorithmic details.

Could you elaborate just a little on your work towards partial retrieval of data from an archive? I'm absolutely open to managing data at a less granular level, e.g. all samples per class gathered in a single file, but currently I'd (a) loose deduplication when only some samples change because the whole file gets reuploaded, (b) I need to retrieve the whole file even when I only want a subset (which might be adressed by the work you've mentioned?), and (c) I need to decide on the granularity and content slice of the file/archive manually (which might be unavoidable because the optimal strategy might depend on data access and mutation patterns, and I don't know whether there is a generic near-optimal solution to this problem; it feels a bit related to RDMS index creation).

I'm afraid I won't be able to join our meeting today because I'm already on Easter vacation. But I think the main topic is a different one anyway. I'd be happy to continue discussing here, or if you think it makes sense to also chat about it synchronously, we could schedule another meeting. 🙂

dmpetrov · 2023-04-05T06:47:53Z

Could you elaborate just a little

Sure! First, the webdataset based approach works well only for "append-only" use cases. For example, you will be adding each new batch of images as a new tar file to a class directory but not deleting the images. Physical file deletion has a huge overhead, logical (without touching the archives) is not a problem but it lies on shoulders of user.

(a) loose deduplication

It depends on how you "package" the archive. It is possible not to loose it and even get local file caching (per image file).

(b) I need to retrieve the whole file even when I only want a subset (which might be adressed by the work you've mentioned?)

You can retrieve only files you need.
Example: tar with 4 files f1, f2, f3, f4. 1MB each file. If you "know" the archive structure you can download only required files f2 and f3 - only 2MB out of 4MB tar file. Moreover, if you are lucky enough it will optimize the number of requests - f2 and f3 can be downloaded in one request if they next to each other in the archive. Bigger %% of files that you ask - more luck you get 🍀

(c) I need to decide on the granularity and content slice of the file/archive manually

Right. It is doable and you can address files by the archive name and image file name in it.

it feels a bit related to RDMS index creation

Yes. "know the archive structure" from the above means "index".

I'm already on Easter vacation

Let me chat with your folks and it can clarify what is the best way for us to communicate 🙂

Happy easter 🐰

sisp · 2023-04-05T07:21:35Z

Thanks for your detailed reply, this is very helpful. 🙏

Managing files in an archive including batched download of adjacent files etc. is something you're working on adding to DVC, but it's not yet possible. Right?

The append-only property sounds like a reasonable requirement to me. Typically, new data becomes available, gets appended to a dataset, and becomes available in a new release. For structured data, when the data structure changes (new field/column, changed representation of a field/column, etc.), all samples might need to be updated and uploaded without an opportunity for deduplication anyway. The only case I'm still a bit concerned about is fixing some samples in a datset, e.g. when NaN values remained in v1.0.0 and were fixed (by whatever method) in v1.0.1. In this case, only some samples would be updated. With one sample per file, only the updated files would get uploaded again, but with archives I imagine the whole archive would have to be uploaded although most samples might not have changed. Right?

Let me chat with your folks and it can clarify what is the best way for us to communicate 🙂

Happy easter 🐰

Sounds good. 👌 Happy Easter to you, too. 🐰

dmpetrov · 2023-04-07T18:15:39Z

adding to DVC, but it's not yet possible. Right?

Well, it is actually DVC-sister project that we are working on. The new product is designed for genAI use cases.

The append-only property sounds like a reasonable requirement to me. ... fixing some samples in a datset

Great! And you are right about the data fixes. However, the majority of fixes are happening on meta-data/label level rather than files/images. The meta-data fixes is not an issue at all if you manage it separately from the data (which is a fairly common pattern).

with archives I imagine the whole archive would have to be uploaded although most samples might not have changed.

It depends on use cases. In some cases, virtually removing a subset of images from a dataset is enough without any changes in the files or archives. All can be done in the meta data level.

PS: I had a great chat with your teammate recently. I hope to see you both in the next meeting 🙂

Rob174 · 2024-02-27T14:48:47Z

Hello, are there any news about this feature?

efiop added the enhancement Enhances DVC label Jun 30, 2018

efiop added this to the 0.11.0 milestone Jun 30, 2018

efiop mentioned this issue Jun 30, 2018

Introduce dir cache v2 #827

Closed

efiop mentioned this issue Oct 18, 2018

remote storage auto-compression #1239

Closed

dmpetrov mentioned this issue Jan 15, 2019

Dataset storage improvements #1487

Closed

sotte mentioned this issue Jan 24, 2019

dvc: support distributed cache #1450

Closed

ghost mentioned this issue Feb 4, 2019

Faster checksum #1568

Closed

efiop mentioned this issue May 31, 2019

Parallelize computing md5 for large files #2073

Closed

efiop added p4-not-important feature request Requesting a new feature labels Jul 23, 2019

efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 25, 2019

shcheklein mentioned this issue Mar 6, 2020

Support for storing large files as blobs (to reduce duplication of chunks) #3445

Closed

dmpetrov mentioned this issue Jun 14, 2020

Introducing cache types: data, metrics and plots, run-cache and per-file #4040

Closed

efiop mentioned this issue Oct 18, 2020

dvc checkout will leave empty folders #2397

Closed

This was referenced Oct 29, 2020

Pushing artifacts via WebDAV results in a 411 Length Required response #4796

Closed

[#4819] fixed failing to push some files to http remote #4820

Merged

2.0 release checklist #4841

Closed

efiop mentioned this issue Nov 6, 2020

dvc: use DirInfo instead of dicts/lists #4851

Merged

2 tasks

pmrowla mentioned this issue Jun 24, 2021

directory download and uploads are slow #6222

Open

shcheklein mentioned this issue Jul 1, 2021

June community gems iterative/dvc.org#2501

Merged

dberenbaum mentioned this issue Sep 22, 2021

Support for dataset level versioning for HDF5 files. #6660

Closed

efiop mentioned this issue Oct 8, 2021

Using dvc only for dataset management (e.g. no dvc run pipeline). #2697

Closed

efiop mentioned this issue Dec 5, 2021

Release 3.0 #7093

Closed

4 tasks

pmrowla mentioned this issue Apr 5, 2022

Pushing only diffs #7546

Closed

daavoo mentioned this issue Apr 22, 2022

DVC very slow for datasets with many files #7607

Closed

pmrowla removed their assignment Jul 11, 2022

sisp mentioned this issue Aug 4, 2022

dvc push fails with "Timeout on reading data from socket" with HTTP storage iterative/dvc-http#27

Closed

dberenbaum added the performance improvement over resource / time consuming tasks label Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split data into blocks #829

Split data into blocks #829

efiop commented Jun 30, 2018

sotte commented Nov 14, 2018

efiop commented Jun 1, 2019

salotz commented Dec 12, 2019

jorgeorpinel commented Sep 2, 2020

karajan1001 commented Sep 27, 2020

sotte commented Sep 27, 2020 •

edited by jorgeorpinel

Loading

efiop commented Oct 18, 2020

bobertlo commented Nov 10, 2021

efiop commented Nov 10, 2021 •

edited

Loading

bobertlo commented Nov 11, 2021

karajan1001 commented Nov 12, 2021

josecelano commented Jul 11, 2022

Chickenmarkus commented Jul 11, 2022 •

edited

Loading

efiop commented Jul 11, 2022

efiop commented Jul 11, 2022

dmpetrov commented Jul 11, 2022

Chickenmarkus commented Jul 12, 2022 •

edited

Loading

dmpetrov commented Jul 26, 2022

Chickenmarkus commented Aug 26, 2022 •

edited

Loading

sisp commented Mar 23, 2023

dmpetrov commented Apr 5, 2023 •

edited

Loading

sisp commented Apr 5, 2023

dmpetrov commented Apr 5, 2023

sisp commented Apr 5, 2023

dmpetrov commented Apr 7, 2023 •

edited

Loading

Rob174 commented Feb 27, 2024 •

edited

Loading

Split data into blocks #829

Split data into blocks #829

Comments

efiop commented Jun 30, 2018

sotte commented Nov 14, 2018

efiop commented Jun 1, 2019

salotz commented Dec 12, 2019

jorgeorpinel commented Sep 2, 2020

karajan1001 commented Sep 27, 2020

sotte commented Sep 27, 2020 • edited by jorgeorpinel Loading

efiop commented Oct 18, 2020

bobertlo commented Nov 10, 2021

efiop commented Nov 10, 2021 • edited Loading

bobertlo commented Nov 11, 2021

karajan1001 commented Nov 12, 2021

josecelano commented Jul 11, 2022

Chickenmarkus commented Jul 11, 2022 • edited Loading

efiop commented Jul 11, 2022

efiop commented Jul 11, 2022

dmpetrov commented Jul 11, 2022

Chickenmarkus commented Jul 12, 2022 • edited Loading

dmpetrov commented Jul 26, 2022

Chickenmarkus commented Aug 26, 2022 • edited Loading

sisp commented Mar 23, 2023

dmpetrov commented Apr 5, 2023 • edited Loading

sisp commented Apr 5, 2023

dmpetrov commented Apr 5, 2023

sisp commented Apr 5, 2023

dmpetrov commented Apr 7, 2023 • edited Loading

Rob174 commented Feb 27, 2024 • edited Loading

sotte commented Sep 27, 2020 •

edited by jorgeorpinel

Loading

efiop commented Nov 10, 2021 •

edited

Loading

Chickenmarkus commented Jul 11, 2022 •

edited

Loading

Chickenmarkus commented Jul 12, 2022 •

edited

Loading

Chickenmarkus commented Aug 26, 2022 •

edited

Loading

dmpetrov commented Apr 5, 2023 •

edited

Loading

dmpetrov commented Apr 7, 2023 •

edited

Loading

Rob174 commented Feb 27, 2024 •

edited

Loading