-
Notifications
You must be signed in to change notification settings - Fork 83
sqlite format
Actually I'd like to move to sqlite for storing hashes. It supports caching, random I/O (obviously), is extensible an a well known project. I don't want to implement most of these features on my own when I can have someone else do it for me :) I will have to dust off my SQL skills. The random access in particular will be important for our find-new feature as duperemove will want to update hashes out of any particular order.
Interesting stuff about sqlite programming : http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite
So we're designing tables here then. I'm thinking the following tables:
-
config table: basic key-value for things like
- format version
- which hash is used
- blocksize
- num files (maybe not needed? we might be able to query a table size)
- num hashes (maybe not needed? we might be able to query a table size)
-
subvolumes table: info on btrfs subvolumes
- subvol id
- subvol uuid
- subvol path (from btrfs root)
-
file info table: basically what we store in struct file_info
- ino
- subvolid
- num_blocks
- name
-
hashes table: basically what we store in struct block_hash with additional info to get back to a file
- file id (ino / subvol): Should allow us to do a lookup to the file info table (why do we need this?)
- digest: Index by this field? I think we probably want same digests as close to each other as possible. However, indexing by fileid will make the db most "streamable" (sequential read/write).
- block flags
- loff