Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split database construction and release processes; provide database catalogs #1569

Closed
ctb opened this issue Jun 5, 2021 · 9 comments
Closed

Comments

@ctb
Copy link
Contributor

ctb commented Jun 5, 2021

Over in sourmash_databases, we have pipelines that sketch genomes and produce zipfile collections at various ksizes, moltypes, etc.

Separately, we have been taking these zipfile collections and constructing .sbt.zip and lca.json.gz indexed databases from them. This is really nice and easy now! (sourmash index out.sbt.zip in.zip)

I think we should spit these processes formally and automate this latter process with snakemake. This latter process would:

  • take as input a CSV or YAML file containing the collection name, the zipfile info with ksize and moltype, and their taxonomy spreadsheets;
  • produce all standard indices (currently sbt.zip and lca.json.gz);
  • create content catalogs and validate content lists with sourmash sig describe;
  • upload the latest version of the databases;
  • (maybe) produce a database catalog that could be used to do things like
    • search for available databases/releases with a new sourmash subcommand
    • automatically download them
    • find databases that have a particular accession or genome in them (based on the catalog, not the signatures ;)

re #991 (distributed as bdbags?) and #1511 (what databases should we provide?) and maybe also #1352 (manifests)

@ctb
Copy link
Contributor Author

ctb commented Jun 5, 2021

oh, and I think we might need database versions, too...? or at least md5sum hashes so we can tell people they have the wrong version of a database.

@ctb
Copy link
Contributor Author

ctb commented Jun 12, 2021

note to self: distribute outputs of sourmash sig describe --csv catalog.csv, which will then be useful as picklists :)

@ctb
Copy link
Contributor Author

ctb commented Jun 26, 2021

see Snakefile etc, #1511 (comment)

I think it would be good to provide some minimal benchmarks with each database/release in terms of memory usage and so on, too.

@ctb
Copy link
Contributor Author

ctb commented Mar 16, 2022

Progress!

I think the next step will be to add identifier filtering for the genbank script.

Using the latest code in https://github.com/ctb/2022-sourmash-sketchfrom, all of the below examples produce a CSV file that's compatible with sourmash sketch fromfile. 🎉

They also do the Right Thing with respect to names, so the sequences end up being named properly. 🎉

make a fromfile CSV from genbank genome/protein files

% ./genbank-to-fromfile.py ncbi-assemblies/* -o xyz.csv -t gtdb-rs202.taxonomy.v2.db 
processing file 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_genomic.fna.gz'
(new record for name 'GCF_000018865.1 s__Chloroflexus aurantiacus')
processing file 'ncbi-assemblies/GCF_000018865.1_ASM1886v1_protein.faa.gz'
(merging into existing record)
---
wrote 1 entries to 'xyz.csv'

make a fromfile CSV from FASTA files based on record names

note: fasta-to-fromfile.py autodetects sequence type.

./fasta-to-fromfile.py podar-ref/[12].fa -o podar.csv 
processing file 'podar-ref/1.fa'
(new record for identifier 'CP001941' moltype=DNA)
processing file 'podar-ref/2.fa'
(new record for identifier 'CP001071' moltype=DNA)
---
wrote 2 entries to 'podar.csv'

make a fromfile CSV from FASTA files based on filename

% ./fasta-to-fromfile.py podar-ref/[12].fa -o podar.csv --ident-from-filename
processing file 'podar-ref/1.fa'
(new record for identifier '1' moltype=DNA)
processing file 'podar-ref/2.fa'
(new record for identifier '2' moltype=DNA)
---
wrote 2 entries to 'podar.csv'

@bluegenes
Copy link
Contributor

Really excited about this!!

Genbank-to-fromfile got me thinking about downloading the FASTA files -have you thought about generating a fromfile csv via the genbank-genomes style information, with download urls included?

Perhaps what I'm thinking of is that we would like to generate a csv like this for the download/prepare FASTA files side of the workflow. Since it would contain the info we need for sketch fromfile, we could then also use it here.

Would this be better over in sourmash_databases? It's not always as simple as download --> sketch, since sometimes .faa files don't exist or assemblies get updated. So we probably want to be able to check these cases at some point while/before building the databases.

@ctb
Copy link
Contributor Author

ctb commented Mar 20, 2022

Really excited about this!!

Genbank-to-fromfile got me thinking about downloading the FASTA files -have you thought about generating a fromfile csv via the genbank-genomes style information, with download urls included?

We already have code in genome-grist to download the genomes based on accession, which leads me to think in two directions:

  • one is that we can further specialize the genbank-based workflow to deal with finding URLs, etc.
  • the other is that (like the fromfile building stuff and genome-grist) this doesn't belong in sourmash per se.

Perhaps what I'm thinking of is that we would like to generate a csv like this for the download/prepare FASTA files side of the workflow. Since it would contain the info we need for sketch fromfile, we could then also use it here.

Would this be better over in sourmash_databases? It's not always as simple as download --> sketch, since sometimes .faa files don't exist or assemblies get updated. So we probably want to be able to check these cases at some point while/before building the databases.

I think this should be part of a separate workflow (but having the issue here is fine :).

The high latency involved in downloading lots of remote files makes it a whole different ballgame. But it sure would be nice to have automatic genome downloading, proteome preparation, etc.!

@mr-eyes
Copy link
Member

mr-eyes commented Mar 20, 2022

Don't know if that helps,
Recently, I automated the download of genomes by accessions through the new NCBI API. Here's what I did:

wget -nc https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCA_019454045.1/download -O GCA_019454045.1.zip
unzip GCA_019454045.1.zip -d GCA_019454045.1
# This because the extracted directory might contain multiple files, like chromosomes, one chr per file.
cat GCA_019454045.1/ncbi_dataset/data/GCA_019454045.1/*fna > GCA_019454045.1.fna
rm -rf GCA_019454045.1/

@ctb
Copy link
Contributor Author

ctb commented Mar 21, 2022

Don't know if that helps, Recently, I automated the download of genomes by accessions through the new NCBI API.

very cool!! We should probably change genome-grist to use this.

This doesn't change my hot take that it is productive to separate:

(1) high latency/one-time efforts like downloading new genomes and proteomes
(2) big-compute one-time efforts like computing protein sets for genomes where they do not yet exist
(3) big-compute/big correlation but infrequent efforts like computing new sketches for very large collections
(4) annoying integrative efforts to produce new databases that correctly represent all of the above

I think we have a handle on most of these as separate processes and think that combining them into one big workflow would make them frustrating and hard to debug.

eventually we will probably want to automate more of this for diff or patch databases, a la #985

and of course there are other sketching targets to think about.

@ctb
Copy link
Contributor Author

ctb commented May 1, 2022

closing in favor of #2015.

@ctb ctb closed this as completed May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants