SBT loading in memory #475

phiweger · 2018-05-19T12:30:44Z

Is there a way to load an SBT into memory once and then keep it there for various queries? I know that a Redis backend was toyed with at some point, but I am unsure if this was integrated into v2.0

Thank you,
Adrian

luizirber · 2018-05-20T00:33:03Z

Not yet... I need something like this for wort, and was talking with @psdehal (since they implemented this for KBase, but there still issues to solve).

phiweger · 2018-05-20T10:52:10Z

What is wort?

luizirber · 2018-05-20T20:03:44Z

@phiweger it's a webservice for computing/retrieving/searching sourmash signatures. I don't have much yet, but an overview is available at https://github.com/dib-lab/wort/blob/master/docs/arch.md and it is online (and horrible to navigate...) at https://wort.oxli.org

ctb · 2020-06-20T15:32:49Z

kind of related to #909, sharing an LCA index once for many processes.

phiweger · 2020-09-30T11:29:44Z

I now run into this problem a lot, especially when querying many signatures as part of larger workflows. The workflow manager will usually start many processes searching signatures, but with any reasonably sized SBT this crashes pretty quickly bc/ it tries to load one SBT into memory for each process. @ctb if I use the API to load the SBT and then python multiprocess queries against it, what will happen? ;)

phiweger · 2020-09-30T11:46:22Z

@luizirber how do you manage queries w/ wort? I assume you don't load the index once for each query?

luizirber · 2020-09-30T16:50:29Z

how do you manage queries w/ wort? I assume you don't load the index once for each query?

So, the SRA search is cheating =]
I load all the queries in memory, and then each thread process a chunk of the metagenomes sigs by loading each metagenome sig, comparing to all query sigs, and unloading the metagenome sig. In this way the memory consumption is pretty low.

if I use the API to load the SBT and then python multiprocess queries against it, what will happen? ;)

It will probably mostly-sorta-kinda work. I'm a bit nervous because the SBT code loads data from disk dynamically, and in a multithreaded context this can lead to data races and other weirdness (there is no locking in any point).
(Incidentally, a large push for Rust in sourmash was exactly for these sort of cases, but the SBT impl in Rust is not complete enough for general use yet =/)

luizirber · 2020-09-30T16:53:00Z

the latest branch has a --cache-size parameter in sourmash gather that can help with controlling how much memory is used: #1161

luizirber · 2020-11-10T05:53:33Z

hey, I think greyhound and #1226 actually help solve this too

ctb · 2021-06-25T20:51:12Z

with sourmash v4.1.0 the memory usage of SBTs has dramatically decreased; see #1370 (comment) specifically.

for in-memory single-process stuff, LCA_Database is a good choice, and it's fairly fast to create them dynamically for small to medium sized databases.

Finally, for read only SBTs, I very much doubt there would be any problems with sharing them between processes from disk.

phiweger · 2021-06-26T11:08:56Z

@ctb when I load an LCA db into mem with sourmash.load_file_as_index() and then use the .search() method, are LCA and SBT interchangeable? Like, when would I use one over the other? Thank you for clarifying.

ctb · 2021-06-26T12:58:30Z

Yep, they are interchangeable from an API perspective!

We have some (minimal :) documentation here,

https://sourmash.readthedocs.io/en/latest/command-line.html#indexed-databases

and there's an example of using/constructing an in-memory LCA_Database here:

https://github.com/dib-lab/charcoal/blob/latest/charcoal/compare_taxonomy.py#L177

Note that the lca_db.insert(...) function used there takes an optional identifier as an argument, but if the signatures are named sensibly you don't need to pass that in.

luizirber added the sbt label Jun 7, 2018

ctb mentioned this issue Jun 26, 2021

add explicit documentation about database types/formats #1293

Closed

ctb added the faq things to add to an FAQ or docs label Mar 30, 2022

ctb mentioned this issue May 3, 2022

[MRG] add advanced database docs #2025

Merged

1 task

ctb closed this as completed in #2025 May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SBT loading in memory #475

SBT loading in memory #475

phiweger commented May 19, 2018

luizirber commented May 20, 2018 •

edited

Loading

phiweger commented May 20, 2018

luizirber commented May 20, 2018

ctb commented Jun 20, 2020

phiweger commented Sep 30, 2020

phiweger commented Sep 30, 2020

luizirber commented Sep 30, 2020 •

edited

Loading

luizirber commented Sep 30, 2020

luizirber commented Nov 10, 2020

ctb commented Jun 25, 2021

phiweger commented Jun 26, 2021

ctb commented Jun 26, 2021

SBT loading in memory #475

SBT loading in memory #475

Comments

phiweger commented May 19, 2018

luizirber commented May 20, 2018 • edited Loading

phiweger commented May 20, 2018

luizirber commented May 20, 2018

ctb commented Jun 20, 2020

phiweger commented Sep 30, 2020

phiweger commented Sep 30, 2020

luizirber commented Sep 30, 2020 • edited Loading

luizirber commented Sep 30, 2020

luizirber commented Nov 10, 2020

ctb commented Jun 25, 2021

phiweger commented Jun 26, 2021

ctb commented Jun 26, 2021

luizirber commented May 20, 2018 •

edited

Loading

luizirber commented Sep 30, 2020 •

edited

Loading