Skip to content

sqlite format

clobrother edited this page Apr 27, 2015 · 6 revisions

New file format design

Actually I'd like to move to sqlite for storing hashes. It supports caching, random I/O (obviously), is extensible an a well known project. I don't want to implement most of these features on my own when I can have someone else do it for me :) I will have to dust off my SQL skills. The random access in particular will be important for our find-new feature as duperemove will want to update hashes out of any particular order.

Interesting stuff about sqlite programming : http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite

So we're designing tables here then. I'm thinking the following tables:

  • config table: basic key-value for things like

    • format version
    • which hash is used
    • blocksize
    • num files (maybe not needed? we might be able to query a table size)
    • num hashes (maybe not needed? we might be able to query a table size)
  • subvolumes table: info on btrfs subvolumes

    • subvol id
    • subvol uuid
    • subvol path (from btrfs root)
  • file info table: basically what we store in struct file_info

    • ino
    • subvolid
    • num_blocks
    • name
  • hashes table: basically what we store in struct block_hash with additional info to get back to a file

    • file id (ino / subvol): Should allow us to do a lookup to the file info table (why do we need this?)
    • digest: Index by this field? I think we probably want same digests as close to each other as possible. However, indexing by fileid will make the db most "streamable" (sequential read/write).
    • block flags
    • loff