Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add sqlite3 implementations for Index, CollectionManifest, and LCA_Database #1808

Merged
merged 281 commits into from
Apr 26, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jan 24, 2022

This PR adds three new SQLite-based features to sourmash, SqliteIndex, SqliteCollectionManifest, and LCA_SqliteDatabase. These new features provide the following features (in order of their estimated importance to the future world domination of sourmash):

  • fast SQLite-based manifest support for extremely large collections
  • low-memory, low-latency, and multiprocess LCA databases for better LCAing,
  • a more unified approach to SQLite database handling in sourmash,
  • a reasonably nice alternative to other indexed collections;

Fixes #1807
Fixes #1930
Fixes #948 by providing a faster LCA_Database
Fixes #821
Fixes #1111 by using SQLite rather than JSON.

Provides one solution for #909, a shared LCA index for many processes.

Highlights

SqliteIndex is a SQLite based Index class that:

  • stores sketches and hashes in a sqlite db;
  • is restricted to storing scaled sketches with a single scaled, albeit with multiple ksizes and moltypes;
  • does efficient, on-the-fly downsampling of sketches;

SqliteCollectionManifest is a SQLite based Manifest class that:

  • stores manifest entries in sqlite;
  • handles selection and picklists appropriately, in SQL (so, fast!);
  • joins CollectionManifest in being loadable via the CLI in a StandaloneManifestIndex as a first-class Index.

LCA_SqliteDatabase is a SQLite-based LCA class built on top of SqliteIndex and LineageDB_Sqlite that supports all of the LCA commands.

This PR introduces much of the remaining functionality from "manifests of manifests" #1652 #1685 into sourmash, after merge of StandaloneManifestIndex in #1891.

Reading and writing

All classes are fully integrated into the sourmash command-line functionality.

Collections of all three types can be read as regular ol' Index classes using sourmash.load_file_as_index(...).

SQLite databases can be output with -o <name>.sqldb, and are read natively on the command line.

SQLite manifests can be output by passing -F sql to sourmash sig manifest, and will be read natively by sourmash as StandaloneManifestIndex.

SQLite LCA databases can be output by passing -F sql to sourmash lca index.

Other additions

This PR also:

  • fixes GitHub workflow caching to track setup.cfg, and provides a version number to manually increment when new dependencies are added;
  • adds the -d/--debug flag to sourmash search
  • fixes a bug where LCA_Database.signatures(...) didn't pay attention to picklists.

Implementation notes

SqliteIndex is built on SqliteCollectionManifest, and SqliteCollectionManifest is usable without SqliteIndex.

SqliteIndex and SqliteCollectionManifest share the sketches table.

SqliteIndex supports a scaled of 1 - that is, direct storage of unsigned 64-bit numbers! - via the bitstring library, which converts unsigned long longs into signed long longs that sqlite can store. This is a new dependency.

SQLite databases, and backwards compatibility

This PR also adds a required table to all sourmash SQLite databases: the table sourmash_internal contains key, value pairs that document what version of SqliteIndex, SqliteManifest, and SqliteLineage (taxonomy) are supported by the database. See the module sourmash.sqlite_utils for supporting code.

The only previous use of SQLite in sourmash was in taxonomy databases, for the sourmash.tax module, output by sourmash tax prepare. This code used the table taxonomy. Moving forward this table has been renamed to sourmash_taxonomy and taxonomy SQLite databases also contain the sourmash_internal table with a SqliteLineage entry. The previous use is fully supported by backwards compatibility code.

Technical notes

(the below is also enshrined in the docstring in index/sqlite_index.py) -

sqlite3 based Index, CollectionManifest, and LCA_Database
implementations.

These classes support a variety of flexible and fast on-disk storage,
search, and retrieval functions.

SqliteIndex stores full scaled signatures; sketches are stored as
reverse-indexed collections of hashes. Search is optimized via the
reverse index. Num and abund sketches are not supported. All scaled
values must be the same upon insertion. Multiple moltypes _are_
supported.

SqliteCollectionManifest provides a full implementation of the
manifest API. It can store details for all signature types. When used
as part of a SqliteIndex database, it does not support independent
insertion.

LCA_SqliteDatabase builds on top of SqliteIndex and LineageDB_Sqlite
(in the tax submodule) to provide a full on-disk implementation of
LCA_Database.

Using these classes
-------------------

These classes are fully integrated into sourmash loading.
Internally, use `sqlite_index.load_sqlite_index(...)` to load a specific
file; this will return the appropriate SqliteIndex, StandaloneManifestIndex,
or LCA_Database object.
Use `CollectionManifest.load_from_filename(...)` to load the manifest
directly as a manifest object.

Implementation Details
----------------------

SqliteIndex:
* Hashes with values above MAX_SQLITE_INT=2**63-1 are transformed into
  signed long longs upon insertion, and then back into ulong longs upon
  retrieval.
* Hash overlap is calculated via a SELECT.
* SqliteIndex relies on SqliteCollectionManifest for manifest functionality,
  including signature selection and picklists.
SqliteCollectionManifest:
* each object maintains info about whether it is being "managed" by a
  SqliteIndex class or not. If it is, `_insert_row(...)` cannot be
  called directly.
* `select(...)` operates directly with SQL queries, except for
  picklist selection, which involves inspect each manifest row in
  Python. In addition to being (much) simpler, this ends up being
  faster in some important real world situations, even for millions of
  rows!
* filter_on_rows and filter_on_columns also both operate in Python,
  not SQL.
* for this reason, the `locations()` method returns a superset of
  locations.  This is potentially very significant if you do a select
  with a picklist that ignores most sketches - the `locations()`
  method will ignore the picklist.
  
Limitations:
* all of these classes share a single connection object, and it could
  get confusing quickly if you simultaneously insert and query. We suggest
  separating creation and insertion. That having been said, these databases
  should work fine for many simultaneous queries; just don't write :).

TODO testing: test internal and command line for,

  • creating an index
  • loading an index
  • loading an index as a manifest (should work)
  • loading an index as a standalone manifest and inserting (should fail)
  • creating a manifest, using from CLI
  • loading a manifest as a standalone index
  • loading a manifest as a standalone index in wrong directory
  • loading a manifest as a standalone index and insert (should succeed)
  • loading/using a checked-in index
  • loading/using a checked-in manifest
  • loading a lineage db with old table
  • loading a lineage db with new table name/sourmash info
  • loading a checked-in lineage db with old table
  • loading a checked-in lineage db with new table
  • do some realistic benchmarking of SqliteIndex and LCA_Database
  • check LCA database is loaded by load_sqlite_index.
  • check 'summarize' output on all three
  • test identifiers with '.' in sql lca dbs...
  • test LCA_SqliteDatabase.open with a SqliteIndex.
  • test LCA_SqliteDatabase.open with an empty SqliteIndex
  • implement update lca DB/on disk insert stuff for lca index?
  • test LCA_SqliteDatabase with a different lineagedb class?
  • add -F sql to sig check for manifest output
  • test SaveSignatures SQLite class repr.
  • test db_outfile += '.lca.json' line in lca_db.py
  • test raise Exception(f"unknown save format for LCA_Database: '{format}'") in lca_db.py
  • test append= True in manifest.py
  • test __eq__ in CollectionManifest where manifests are NOT equal`p
  • test sqlite_utils/bad version code
  • test tax_utils/bad version code
  • test tax_utils/no version code
  • Write more about new sourmash_internal SQLite table, and also how the various sqlite things are compatible.
  • document new taxonomy foo in PR description

@codecov
Copy link

codecov bot commented Jan 24, 2022

Codecov Report

Merging #1808 (9e616dd) into latest (5da5ede) will increase coverage by 0.46%.
The diff coverage is 93.39%.

@@            Coverage Diff             @@
##           latest    #1808      +/-   ##
==========================================
+ Coverage   83.53%   83.99%   +0.46%     
==========================================
  Files         127      129       +2     
  Lines       14233    14937     +704     
  Branches     1946     2079     +133     
==========================================
+ Hits        11890    12547     +657     
- Misses       2070     2095      +25     
- Partials      273      295      +22     
Flag Coverage Δ
python 91.56% <93.39%> (+0.12%) ⬆️
rust 65.16% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/sourmash_args.py 93.20% <89.65%> (-0.21%) ⬇️
src/sourmash/index/sqlite_index.py 91.72% <91.72%> (ø)
src/sourmash/manifest.py 94.81% <94.28%> (-0.28%) ⬇️
src/sourmash/cli/lca/index.py 100.00% <100.00%> (ø)
src/sourmash/cli/search.py 100.00% <100.00%> (ø)
src/sourmash/cli/sig/check.py 100.00% <100.00%> (ø)
src/sourmash/cli/sig/manifest.py 100.00% <100.00%> (ø)
src/sourmash/commands.py 88.37% <100.00%> (ø)
src/sourmash/index/__init__.py 97.07% <100.00%> (-0.03%) ⬇️
src/sourmash/lca/command_index.py 89.90% <100.00%> (+0.40%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5da5ede...9e616dd. Read the comment docs.

@mr-eyes
Copy link
Member

mr-eyes commented Apr 23, 2022

I am pulling updates from latest to see how #1976 will affect the tests on gh-actions.

@mr-eyes
Copy link
Member

mr-eyes commented Apr 24, 2022

Is building SBT SQLite index supported in the command line?
sourmash index does not contain -F {json,sql}, --database-format {json,sql}.

src/sourmash/lca/lca_db.py Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor Author

ctb commented Apr 24, 2022

Is building SBT SQLite index supported in the command line? sourmash index does not contain -F {json,sql}, --database-format {json,sql}`.

sourmash index only builds SBTs; we've got an issue about changing that (#949).

SQLite databases (unlike SBT or LCA indexes) can be created with sourmash sig cat ... -o xyz.sqldb.

I agree this is confusing and inconsistent 😆 . I'll think about what to do - at the very least, better docs would be a good idea!

@ctb
Copy link
Contributor Author

ctb commented Apr 24, 2022

Is building SBT SQLite index supported in the command line? sourmash index does not contain -F {json,sql}, --database-format {json,sql}`.

I agree this is confusing and inconsistent 😆 . I'll think about what to do - at the very least, better docs would be a good idea!

updated docs in 16b8b9b, and added some suggestions in #1949.

@ctb
Copy link
Contributor Author

ctb commented Apr 24, 2022

Thanks for catching the lca index problems with running twice!

@mr-eyes
Copy link
Member

mr-eyes commented Apr 24, 2022

There are inconsistent results when running lca classify on same lca database built with both SQL and json. You can reproduce here:

curl -L https://osf.io/bw8d7/download?version=1 -o delmont-subsample-sigs.tar.gz
tar xzf delmont-subsample-sigs.tar.gz
curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv
sourmash lca index -f tara-delmont-SuppTable3.csv delmont.lca.json delmont-subsample-sigs/*.sig
sourmash lca index -F sql -f tara-delmont-SuppTable3.csv delmont.lca.sql delmont-subsample-sigs/*.sig
sourmash lca classify --db delmont.lca.json --query delmont-subsample-sigs/TARA_RED_MAG_00003.fa.gz.sig > json_results
sourmash lca classify --db delmont.lca.sql --query delmont-subsample-sigs/TARA_RED_MAG_00003.fa.gz.sig > sql_results
diff json_results sql_results

Using the SQLite database gave no matches, while there is a match using the JSON version.

@ctb
Copy link
Contributor Author

ctb commented Apr 25, 2022

There are inconsistent results when running lca classify on same lca database built with both SQL and json.

WOW, very nice catch!

I gotta say I was pretty worried when I first saw this, but it turns out that rather than being some deep and structural flaw in how I was handling identifiers differently b/t SQL and JSON, it was just a straight up simple programming error 😅.

Fixed in 52bc90b.

@mr-eyes
Copy link
Member

mr-eyes commented Apr 25, 2022

WOW, very nice catch!

Thank you!

I gotta say I was pretty worried when I first saw this.

Likewise! and was resisting myself to debugging xD Glad it was straightforward.

@ctb
Copy link
Contributor Author

ctb commented Apr 25, 2022

Likewise! and was resisting myself to debugging xD Glad it was straightforward.

it took me longer to track it down than I expected since I was gearing up for a big debugging effort for a subtle effect 😆

@ctb ctb requested a review from mr-eyes April 26, 2022 18:45
Copy link
Member

@mr-eyes mr-eyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a quite big PR to review, maybe I missed things or complicated-heuristical bugs while reviewing but I did a lot of testing and code inspection that I feel confident to merge.

I think it's good to wait for merging #2000 into this PR before merging to latest.

@ctb
Copy link
Contributor Author

ctb commented Apr 26, 2022

This was a quite big PR to review, maybe I missed things or complicated-heuristical bugs while reviewing but I did a lot of testing and code inspection that I feel confident to merge.

yay, thank you!

I think it's good to wait for merging #2000 into this PR before merging to latest.

I'm going to merge it now, tho, since I haven't decided on the scope of #2000 yet; might be bigger than just a convert command!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants