-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add sqlite3 implementations for Index
, CollectionManifest
, and LCA_Database
#1808
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1808 +/- ##
==========================================
+ Coverage 83.53% 83.99% +0.46%
==========================================
Files 127 129 +2
Lines 14233 14937 +704
Branches 1946 2079 +133
==========================================
+ Hits 11890 12547 +657
- Misses 2070 2095 +25
- Partials 273 295 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
SqliteIndex
, a sqlite-based reverse index for storing and searching scaled signatures
…o add/sqlite_index
I am pulling updates from |
Is building SBT SQLite index supported in the command line? |
SQLite databases (unlike SBT or LCA indexes) can be created with I agree this is confusing and inconsistent 😆 . I'll think about what to do - at the very least, better docs would be a good idea! |
updated docs in 16b8b9b, and added some suggestions in #1949. |
Co-authored-by: Mohamed Abuelanin <[email protected]>
Thanks for catching the |
There are inconsistent results when running curl -L https://osf.io/bw8d7/download?version=1 -o delmont-subsample-sigs.tar.gz
tar xzf delmont-subsample-sigs.tar.gz
curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv
sourmash lca index -f tara-delmont-SuppTable3.csv delmont.lca.json delmont-subsample-sigs/*.sig
sourmash lca index -F sql -f tara-delmont-SuppTable3.csv delmont.lca.sql delmont-subsample-sigs/*.sig
sourmash lca classify --db delmont.lca.json --query delmont-subsample-sigs/TARA_RED_MAG_00003.fa.gz.sig > json_results
sourmash lca classify --db delmont.lca.sql --query delmont-subsample-sigs/TARA_RED_MAG_00003.fa.gz.sig > sql_results
diff json_results sql_results Using the SQLite database gave no matches, while there is a match using the JSON version. |
WOW, very nice catch! I gotta say I was pretty worried when I first saw this, but it turns out that rather than being some deep and structural flaw in how I was handling identifiers differently b/t SQL and JSON, it was just a straight up simple programming error 😅. Fixed in 52bc90b. |
Thank you!
Likewise! and was resisting myself to debugging xD Glad it was straightforward. |
it took me longer to track it down than I expected since I was gearing up for a big debugging effort for a subtle effect 😆 |
…o add/sqlite_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a quite big PR to review, maybe I missed things or complicated-heuristical bugs while reviewing but I did a lot of testing and code inspection that I feel confident to merge.
I think it's good to wait for merging #2000 into this PR before merging to latest
.
yay, thank you!
I'm going to merge it now, tho, since I haven't decided on the scope of #2000 yet; might be bigger than just a |
This PR adds three new SQLite-based features to sourmash,
SqliteIndex
,SqliteCollectionManifest
, andLCA_SqliteDatabase
. These new features provide the following features (in order of their estimated importance to the future world domination of sourmash):Fixes #1807
Fixes #1930
Fixes #948 by providing a faster
LCA_Database
Fixes #821
Fixes #1111 by using SQLite rather than JSON.
Provides one solution for #909, a shared LCA index for many processes.
Highlights
SqliteIndex
is a SQLite based Index class that:SqliteCollectionManifest
is a SQLite based Manifest class that:CollectionManifest
in being loadable via the CLI in aStandaloneManifestIndex
as a first-class Index.LCA_SqliteDatabase
is a SQLite-based LCA class built on top ofSqliteIndex
andLineageDB_Sqlite
that supports all of the LCA commands.This PR introduces much of the remaining functionality from "manifests of manifests" #1652 #1685 into sourmash, after merge of
StandaloneManifestIndex
in #1891.Reading and writing
All classes are fully integrated into the sourmash command-line functionality.
Collections of all three types can be read as regular ol'
Index
classes usingsourmash.load_file_as_index(...)
.SQLite databases can be output with
-o <name>.sqldb
, and are read natively on the command line.SQLite manifests can be output by passing
-F sql
tosourmash sig manifest
, and will be read natively by sourmash asStandaloneManifestIndex
.SQLite LCA databases can be output by passing
-F sql
tosourmash lca index
.Other additions
This PR also:
-d/--debug
flag tosourmash search
LCA_Database.signatures(...)
didn't pay attention to picklists.Implementation notes
SqliteIndex
is built onSqliteCollectionManifest
, andSqliteCollectionManifest
is usable withoutSqliteIndex
.SqliteIndex
andSqliteCollectionManifest
share thesketches
table.SqliteIndex
supports a scaled of 1 - that is, direct storage of unsigned 64-bit numbers! - via thebitstring
library, which converts unsigned long longs into signed long longs that sqlite can store. This is a new dependency.SQLite databases, and backwards compatibility
This PR also adds a required table to all sourmash SQLite databases: the table
sourmash_internal
contains key, value pairs that document what version ofSqliteIndex
,SqliteManifest
, andSqliteLineage
(taxonomy) are supported by the database. See the modulesourmash.sqlite_utils
for supporting code.The only previous use of SQLite in sourmash was in taxonomy databases, for the
sourmash.tax
module, output bysourmash tax prepare
. This code used the tabletaxonomy
. Moving forward this table has been renamed tosourmash_taxonomy
and taxonomy SQLite databases also contain thesourmash_internal
table with aSqliteLineage
entry. The previous use is fully supported by backwards compatibility code.Technical notes
(the below is also enshrined in the docstring in
index/sqlite_index.py
) -TODO testing: test internal and command line for,
-F sql
to sig check for manifest outputSaveSignatures
SQLite class repr.db_outfile += '.lca.json'
line in lca_db.pyraise Exception(f"unknown save format for LCA_Database: '{format}'")
in lca_db.pyappend= True
in manifest.py__eq__
in CollectionManifest where manifests are NOT equal`psourmash_internal
SQLite table, and also how the various sqlite things are compatible.