Skip to content

Commit

Permalink
[MRG] add sqlite3 implementations for Index, CollectionManifest, …
Browse files Browse the repository at this point in the history
…and `LCA_Database` (#1808)

* switch to get_matching_sketches

* change default cache size

* count overlaps in SQL?

* initial addition of 'sig fileinfo'

* finish first-draft implementation of fileinfo and get_manifest

* cleanup and move over to sourmash_args

* add manifest and length support to LCA_Database

* add rebuild/no-rebuild args

* use BitArray to convert uint to int

* cleanup

* fix the things?

* cleanup

* more cleanup

* flag when scores are diff

* fix __len__ for zipfiles, __bool__ interpretation

* add more index, etc

* more cleanup

* correct for rust panic a la zip

* commit every so often...

* add some comments

* get basic manifest-generating machinery working

* update manifest stuff

* add bitstring in support of SqliteIndex

* more cleanup

* add more tests

* add conditions to _get_matching_sketches

* remove conditions

* remove errant raise

* update structure

* some commentary

* switch over to debug_literal

* switch to debug_literal; test tricky ordering

* add LCA database test for tricky ordering

* add test for jaccard ordering to SBTs

* add LCA database test for tricky ordering

* add test for jaccard ordering to SBTs

* add bitstring to setup

* factor out CollectionManifest_Sqlite

* some basic manifests

* add sqlite manifest rows interface

* minor refactor

* support sig manifest / test it

* move row insert into manifest class

* test creation of sqlite mf

* switch to explicit moltype

* cleanup and refactoring

* cleanup

* SQLite manifests are now first class

* pip cache should be looking at setup.cfg I think?

* and tox cache should be looking at setup.cfg, too

* try again/invalidate cache

* try again

* remove print

* fix some stuff

* even more

* add 'sourmash_versions' table

* test direct sqlmf creation & loading

* improve version checkingc

* test various insertion errors

* fix num support in sqlite manifests (but not index)

* add explicit validation code, to be removed later

* explicit check of 'num'

* add more docs/notes/annotations for work

* rename CollectionManifest_Sqlite to SqliteCollectionManifest

* preliminary victory over rankinfo

* provide generic LCA Database functionality via sqlite

* refactor and comment

* refactor and document

* add sqlite_utils

* cleanup

* parse out SqliteIndex.create

* rm comment

* add database_format to lca index

* get sql database output working for LCA index

* get all lca tests working on SQL version of LCA_Database

* add test_index_protocol

* add tests of indices after save/load

* match Index definition of __len__ in sbt

* more index tests

* add some generic manifest tests

* define abstract base class for CollectionManifest

* fix GTDB example, sigh

* test hashval_to_idx

* add actual test for min num in rankinfo

* update 'get_lineage_assignments' in lca_db

* update comment

* make lid_to_idx and idx_to_ident private

* moar comment

* add sqlite clases to protocol tests

* adjust protocol

* update to match protocol

* add, then hide, RevIndex test

* update the LCA_Database protocol

* SqliteCollectionManifest now passes all the tests

* update row check to ignore _ prefixes

* implement remaining lca_db protocol for sqlite

* fix up rankinfo for sqlite LCA_Database

* finish testing the rest of the Index classes

* cleanup

* upd

* cleanup LCA_Database creation

* backport 08ac110

* add sqlite loading to CollectionManifest

* update manifest writing to support SQL, too

* switch to using generic manifest.write_to_filename

* catch pre-existing sqlite DBs

* remove test for now-implemented func

* work through various merge implications

* switch away from a row tuple in CollectionManifest

* more clearly separate internals of LCA_Database from public API

* add saved/loaded manifest

* add test coverage for exceptions in LazyLoadedIndex

* add docstrings to manifest code

* add docstrings / comments

* fix sig check reliance on internal manifest mechanism

* fix picklist stuff when using Sqlite manifests

* add lots of debug stmts

* remove SQLite pickset as impractical

* remove some expensive debugs

* remove sql picklist code as too slow

* comments and cleanup

* much cleanup

* re-add debug_literal

* more cleanup

* comment

* fix 'num' select

* test and document locations()

* use names in namedtuple; add containment test

* add numerical values to jaccard order tests

* cleanup

* remove redundant tests

* test scaled=1 stuff pretty explicitly

* rename 'create_from_manifest' method

* cleanup

* add required_keys check

* check manifest equality only on required keys

* add required_keys check

* add index tests for LCA_SqliteDatabase

* constructor/etc refactoring

* add scaled/dowsample test

* add downsample_scaled etc

* remove unused code

* cleanup

* update comment

* rename tables to have prefix sourmash_

* update with many a test

* fix diagnostic output during sourmash index #1949

* handle bad versions of stuff

* update/simplify version checking

* add append test

* add notes about further tests

* minor comment update

* fix after merge

* update table name for lineage db

* more docs

* implement loading of LCA_SqliteDatabases at command line

* cleanup and testing

* start adding some documentation

* add location and manifest properties to LCA_SqliteDatabase

* update

* update index protocol tests to check location, manifest

* add tests for fileinfo on all sql db variants

* add test for signatures_with_location

* upd

* add test of new-style lineage db file

* upd/cleanup

* try out inheritance instead of composition

* comment

* more cleanup

* clean up LCA_SqliteDatabase

* create some more tests...

* update checklist

* refactor and cleanup

* round out the tests a bit

* allow append

* cleanup, doc

* cleanup/simplify

* support picklists in LCA_Database.signatures

* fix up @ctb in LCA tests

* cleanup @ctb in test_cmd_signature

* add tests for picklist support in LCA_database.signatures()

* many minor updates

* more tests

* add more manifest tests

* add some final? tests

* one final test

* fix typo via @mr-eyes

* remove unnecessary PARSE_DECLTYPES

* add docs for creating sqldb

* do not allow overwrite/append to xisting lca database

* Update src/sourmash/lca/lca_db.py

Co-authored-by: Mohamed Abuelanin <[email protected]>

* fix bug with duplicate lineages in LCA_SqliteDatabase

* fix test broken by duplicate lineage fix

Co-authored-by: Mohamed Abuelanin <[email protected]>
  • Loading branch information
ctb and mr-eyes authored Apr 26, 2022
1 parent 5da5ede commit a4afb68
Show file tree
Hide file tree
Showing 39 changed files with 4,462 additions and 251 deletions.
9 changes: 5 additions & 4 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# note: to invalidate caches, adjust the pip-v? and tox-v? numbers below.
name: Python tests

on:
Expand Down Expand Up @@ -35,9 +36,9 @@ jobs:
uses: actions/cache@v3
with:
path: ${{ steps.pip-cache.outputs.dir }}
key: ${{ runner.os }}-pip-${{ hashFiles('**/setup.py') }}
key: ${{ runner.os }}-pip-v2-${{ hashFiles('**/setup.cfg') }}
restore-keys: |
${{ runner.os }}-pip-
${{ runner.os }}-pip-v2-
- name: Install dependencies
run: |
Expand All @@ -64,9 +65,9 @@ jobs:
uses: actions/cache@v3
with:
path: .tox/
key: ${{ runner.os }}-tox-${{ hashFiles('**/setup.py') }}
key: ${{ runner.os }}-tox-v2-${{ hashFiles('**/setup.cfg') }}
restore-keys: |
${{ runner.os }}-tox-
${{ runner.os }}-tox-v2-
- name: Test with tox
run: tox
Expand Down
55 changes: 44 additions & 11 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -727,7 +727,7 @@ database. It can be used to combine multiple taxonomies into a single file,
as well as change formats between CSV and sqlite3.

The following command will take in two taxonomy files and combine them into
a single taxonomy sqlite database.
a single taxonomy SQLite database.

```
sourmash tax prepare --taxonomy file1.csv file2.csv -o tax.db
Expand Down Expand Up @@ -931,6 +931,15 @@ As of sourmash 4.2.0, `lca index` supports `--picklist`, to
can be used to index a subset of a large collection, or to
exclude a few signatures from an index being built from a large collection.

As of sourmash 4.4.0, `lca index` can produce an _on disk_ LCA
database using SQLite. To prepare such a database, use
`sourmash lca index ... -F sql`.

All sourmash commands work with either type of LCA database (the
default JSON database, and the SQLite version). SQLite databases are
larger than JSON databases on disk but are typically much faster
to load and search, and use much less memory.

### `sourmash lca rankinfo` - examine an LCA database

The `sourmash lca rankinfo` command displays k-mer specificity
Expand Down Expand Up @@ -1399,6 +1408,14 @@ iterating over the signatures in the input file. This can be slow for
large collections. Use `--no-rebuild-manifest` to load an existing
manifest if it is available.

As of sourmash 4.4.0, `sig manifest` can produce a manifest in a fast
on-disk format (a SQLite database). SQLite manifests can be _much_
faster when working with very large collections of signatures.
To produce a SQLite manifest, use `sourmash sig manifest ... -F sql`.

All sourmash commands that work with manifests will accept both
CSV and SQLite manifest files.

### `sourmash signature check` - compare picklists and manifests

Compare picklists and manifests across databases, and optionally output matches
Expand Down Expand Up @@ -1452,7 +1469,7 @@ Briefly,

None of these commands currently support searching, comparing, or indexing
signatures with multiple ksizes or moltypes at the same time; you need
to pick the ksize and moltype to use for your search. Where possible,
to pick the ksize and moltype to use for your query. Where possible,
scaled values will be made compatible.

### Selecting signatures
Expand Down Expand Up @@ -1549,9 +1566,10 @@ In addition to `sig extract`, the following commands support
### Storing (and searching) signatures

Backing up a little, there are many ways to store and search
signatures. `sourmash` supports storing and loading signatures from JSON
files, directories, lists of files, Zip files, and indexed databases.
These can all be used interchangeably for sourmash operations.
signatures. `sourmash` supports storing and loading signatures from
JSON files, directories, lists of files, Zip files, custom indexed
databases, and SQLite databases. These can all be used
interchangeably for most sourmash operations.

The simplest is one signature in a single JSON file. You can also put
many signatures in a single JSON file, either by building them that
Expand All @@ -1567,7 +1585,7 @@ signatures from zip files. You can create a compressed collection of
signatures using `zip -r collection.zip *.sig` and then specify
`collections.zip` on the command line.

### Saving signatures, more generally
### Choosing signature output formats

(sourmash v4.1 and later)

Expand All @@ -1583,6 +1601,7 @@ This behavior is triggered by the requested output filename --
* to save to gzipped JSON signature files, use `.sig.gz`;
* to save to a Zip file collection, use `.zip`;
* to save signature files to a directory, use a name ending in `/`; the directory will be created if it doesn't exist;
* to save to a SQLite database, use `.sqldb` (as of sourmash v4.4.0).

If none of these file extensions is detected, output will be written
in the JSON `.sig` format, either to the provided output filename or
Expand Down Expand Up @@ -1614,22 +1633,36 @@ Indexed databases can make searching signatures much faster. SBT
databases are low memory and disk-intensive databases that allow for
fast searches using a tree structure, while LCA databases are higher
memory and (after a potentially significant load time) are quite fast.
SQLite databases (new in sourmash v4.4.0) are typically larger on disk
than SBTs and LCAs, but in turn are fast to load and support very low
memory search.

(LCA databases also directly permit taxonomic searches using `sourmash lca`
functions.)

Commands that take multiple signatures or collections of signatures
will also work with databases.
will also work with indexed databases.

One limitation of indexed databases is that both SBT and LCA database
can only contain one "type" of signature (one ksize/one moltype at one
scaled value). If the database signature type is incompatible with the
other signatures, sourmash will complain appropriately.
One limitation of indexed databases is that they are all restricted in
to certain kinds of signatures. Both SBT and LCA databases can only
contain one "type" of signature (one ksize/one moltype at one scaled
value). SQLite databases can contain multiple ksizes and moltypes, but
only at one scaled value. If the database signature type is
incompatible with the other signatures, sourmash will complain
appropriately.

In contrast, signature files, zip collections, and directory
hierarchies can contain many different types of signatures, and
compatible ones will be selected automatically.

Use the `sourmash index` command to create an SBT.

Use the `sourmash lca index` command to create an LCA database; the
database can be saved in JSON or SQL format with `-F json` or `-F sql`.

Use `sourmash sig cat <list of signatures> -o <output>.sqldb` to create
a SQLite indexed database.

### Combining search databases on the command line

All of the commands in sourmash operate in "online" mode, so you can
Expand Down
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ install_requires =
scipy
deprecation>=2.0.6
cachetools>=4,<6
bitstring>=3.1.9,<4
python_requires = >=3.8

[bdist_wheel]
Expand Down
6 changes: 6 additions & 0 deletions src/sourmash/cli/lca/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,12 @@ def subparser(subparsers):
'--fail-on-missing-taxonomy', action='store_true',
help='fail quickly if taxonomy is not available for an identifier',
)
subparser.add_argument(
'-F', '--database-format',
help="format of output database; default is 'json')",
default='json',
choices=['json', 'sql'],
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
Expand Down
4 changes: 4 additions & 0 deletions src/sourmash/cli/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ def subparser(subparsers):
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='output debug information'
)
subparser.add_argument(
'--threshold', metavar='T', default=0.08, type=float,
help='minimum threshold for reporting matches; default=0.08'
Expand Down
7 changes: 7 additions & 0 deletions src/sourmash/cli/sig/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,13 @@ def subparser(subparsers):
help='do not require a manifest; generate dynamically if needed',
action='store_true'
)
subparser.add_argument(
'-F', '--manifest-format',
help="format of manifest output file; default is 'csv')",
default='csv',
choices=['csv', 'sql'],
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_pattern_args(subparser)
Expand Down
7 changes: 6 additions & 1 deletion src/sourmash/cli/sig/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,12 @@ def subparser(subparsers):
'--no-rebuild-manifest', help='use existing manifest if available',
action='store_true'
)

subparser.add_argument(
'-F', '--manifest-format',
help="format of manifest output file; default is 'csv')",
default='csv',
choices=['csv', 'sql'],
)

def main(args):
import sourmash
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,7 @@ def search(args):
from .search import (search_databases_with_flat_query,
search_databases_with_abund_query)

set_quiet(args.quiet)
set_quiet(args.quiet, args.debug)
moltype = sourmash_args.calculate_moltype(args)
picklist = sourmash_args.load_picklist(args)
pattern_search = sourmash_args.load_include_exclude_db_patterns(args)
Expand Down
26 changes: 13 additions & 13 deletions src/sourmash/index/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -868,6 +868,15 @@ class MultiIndex(Index):
Note: this is an in-memory collection, and does not do lazy loading:
all signatures are loaded upon instantiation and kept in memory.
There are a variety of loading functions:
* `load` takes a list of already-loaded Index objects,
together with a list of their locations.
* `load_from_directory` traverses a directory to load files within.
* `load_from_path` takes an arbitrary pathname and tries to load it
as a directory, or as a .sig file.
* `load_from_pathlist` takes a text file full of pathnames and tries
to load them all.
Concrete class; signatures held in memory; builds and uses manifests.
"""
def __init__(self, manifest, parent, *, prepend_location=False):
Expand Down Expand Up @@ -1212,8 +1221,7 @@ def load(cls, location, *, prefix=None):
if not os.path.isfile(location):
raise ValueError(f"provided manifest location '{location}' is not a file")

with open(location, newline='') as fp:
m = CollectionManifest.load_from_csv(fp)
m = CollectionManifest.load_from_filename(location)

if prefix is None:
prefix = os.path.dirname(location)
Expand Down Expand Up @@ -1245,20 +1253,12 @@ def _signatures_with_internal(self):
manifest in this class.
"""
# collect all internal locations
iloc_to_rows = defaultdict(list)
for row in self.manifest.rows:
iloc = row['internal_location']
iloc_to_rows[iloc].append(row)

# iterate over internal locations, selecting relevant sigs
for iloc, iloc_rows in iloc_to_rows.items():
# prepend with prefix?
picklist = self.manifest.to_picklist()
for iloc in self.manifest.locations():
# prepend location with prefix?
if not iloc.startswith('/') and self.prefix:
iloc = os.path.join(self.prefix, iloc)

sub_mf = CollectionManifest(iloc_rows)
picklist = sub_mf.to_picklist()

idx = sourmash.load_file_as_index(iloc)
idx = idx.select(picklist=picklist)
for ss in idx.signatures():
Expand Down
Loading

0 comments on commit a4afb68

Please sign in to comment.