-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change .rst to .md Fix/#340 #356
Changes from 5 commits
5b8fb8c
6dd2507
3298f45
68954c6
cde3dd3
ed8e536
332a910
e0472db
02b2ce7
1a079bf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,104 +1,106 @@ | ||
============================ | ||
``sourmash`` Python examples | ||
============================ | ||
|
||
A very simple example: two k-mers | ||
--------------------------------- | ||
# `sourmash` Python examples | ||
|
||
|
||
## A very simple example: two k-mers | ||
|
||
Define two sequences: | ||
|
||
Define two sequences: | ||
``` | ||
>>> seq1 = "ATGGCA" | ||
>>> seq2 = "AGAGCA" | ||
|
||
``` | ||
Create two minhashes using 3-mers, and add the sequences: | ||
|
||
``` | ||
>>> import sourmash_lib | ||
>>> E1 = sourmash_lib.MinHash(n=20, ksize=3) | ||
>>> E1.add_sequence(seq1) | ||
|
||
``` | ||
``` | ||
>>> E2 = sourmash_lib.MinHash(n=20, ksize=3) | ||
>>> E2.add_sequence(seq2) | ||
|
||
``` | ||
One of the 3-mers (out of 7) overlaps, so Jaccard index is 1/7: | ||
|
||
``` | ||
>>> round(E1.jaccard(E2), 2) | ||
0.14 | ||
|
||
``` | ||
and of course the minhashes match themselves: | ||
|
||
``` | ||
>>> E1.jaccard(E1) | ||
1.0 | ||
|
||
``` | ||
We can add sequences and query at any time -- | ||
|
||
``` | ||
>>> E1.add_sequence(seq2) | ||
>>> x = E1.jaccard(E2) | ||
>>> round(x, 3) | ||
0.571 | ||
``` | ||
## Consuming files | ||
|
||
Consuming files | ||
--------------- | ||
|
||
Suppose we want to create MinHash sketches from genomes -- | ||
|
||
``` | ||
>>> import glob, pprint | ||
>>> genomes = glob.glob('data/GCF*.fna.gz') | ||
>>> genomes = list(sorted(genomes)) | ||
>>> pprint.pprint(genomes) | ||
['data/GCF_000005845.2_ASM584v2_genomic.fna.gz', | ||
'data/GCF_000006945.1_ASM694v1_genomic.fna.gz', | ||
'data/GCF_000783305.1_ASM78330v1_genomic.fna.gz'] | ||
|
||
``` | ||
We have to read them in (here using screed), but then they can be fed | ||
into 'add_sequence' directly; here we set 'force=True' in ``add_sequence`` | ||
into `add_sequence` directly; here we set `force=True` in `add_sequence` | ||
to skip over k-mers containing characters other than ACTG, rather than | ||
raising an exception. | ||
|
||
(Note, just for speed reasons, we'll truncate the sequences to 50kb in length.) | ||
|
||
``` | ||
>>> import screed | ||
>>> minhashes = [] | ||
>>> for g in genomes: | ||
... E = sourmash_lib.MinHash(n=500, ksize=31) | ||
... for record in screed.open(g): | ||
... E.add_sequence(record.sequence[:50000], True) | ||
... minhashes.append(E) | ||
|
||
``` | ||
And now the minhashes can be compared against each other: | ||
|
||
``` | ||
>>> import sys | ||
>>> for i, e in enumerate(minhashes): | ||
... _ = sys.stdout.write(genomes[i][:20] + ' ') | ||
... for j, e2 in enumerate(minhashes): | ||
... x = e.jaccard(minhashes[j]) | ||
... _ = sys.stdout.write(str(round(x, 3)) + ' ') | ||
... _= sys.stdout.write('\n') | ||
data/GCF_000005845.2 1.0 0.0 0.0 | ||
data/GCF_000006945.1 0.0 1.0 0.0 | ||
data/GCF_000783305.1 0.0 0.0 1.0 | ||
|
||
data/GCF_000005845.2 1.0 0.0 0.0 | ||
data/GCF_000006945.1 0.0 1.0 0.0 | ||
data/GCF_000783305.1 0.0 0.0 1.0 | ||
``` | ||
Note that the comparisons are quite quick; most of the time is spent in | ||
making the minhashes, which can be saved and loaded easily. | ||
|
||
Saving and loading signature files | ||
---------------------------------- | ||
## Saving and loading signature files | ||
|
||
``` | ||
>>> from sourmash_lib import signature | ||
>>> sig1 = signature.SourmashSignature('[email protected]', minhashes[0], | ||
... name=genomes[0][:20]) | ||
>>> with open('/tmp/genome1.sig', 'wt') as fp: | ||
... signature.save_signatures([sig1], fp) | ||
|
||
Here, ``/tmp/genome1.sig`` is a JSON file that can now be loaded and | ||
``` | ||
Here, `/tmp/genome1.sig` is a JSON file that can now be loaded and | ||
compared -- first, load: | ||
|
||
``` | ||
>>> sigfp = open('/tmp/genome1.sig', 'rt') | ||
>>> siglist = list(signature.load_signatures(sigfp)) | ||
>>> loaded_sig = siglist[0] | ||
|
||
``` | ||
then compare: | ||
|
||
``` | ||
>>> loaded_sig.minhash.jaccard(sig1.minhash) | ||
1.0 | ||
>>> sig1.minhash.jaccard(loaded_sig.minhash) | ||
1.0 | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# `sourmash` Python API | ||
|
||
|
||
The primary programmatic way of interacting with `sourmash` is via | ||
its Python API. (The core MinHash Python API closely mirrors the | ||
underlying C++ code, but for now this is undocumented.) | ||
|
||
Contents | ||
* [`MinHash`: basic MinHash sketch functionality](# `MinHash`: basic MinHash sketch functionality) | ||
* [`SourmashSignature`: save and load MinHash sketches in JSON](# `SourmashSignature`: save and load MinHash sketches in JSON) | ||
* [`sourmash_lib.fig`: make plots and figures](# `sourmash_lib.fig`: make plots and figures) | ||
|
||
## `MinHash`: basic MinHash sketch functionality | ||
|
||
``` | ||
.. automodule:: sourmash_lib | ||
:members: | ||
``` | ||
## `SourmashSignature`: save and load MinHash sketches in JSON | ||
|
||
``` | ||
.. automodule:: sourmash_lib.signature | ||
:members: | ||
``` | ||
## `sourmash_lib.fig`: make plots and figures | ||
|
||
``` | ||
.. automodule:: sourmash_lib.fig | ||
:members: | ||
``` |
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
|
||
# Using sourmash from the command line | ||
|
||
|
||
From the command line, sourmash can be used to compute [MinHash | ||
sketches][0] from DNA | ||
sequences, compare them to each other, and plot the results. This | ||
allows you to estimate sequence similarity quickly and accurately. | ||
|
||
Please see the [mash software][1] http://mash.readthedocs.io/en/latest/__ | ||
and the [mash paper (Ondov et al., 2016)][2] | ||
for background | ||
information on how and why MinHash sketches work. | ||
|
||
______ | ||
|
||
sourmash uses a subcommand syntax, so all commands start with | ||
`sourmash` followed by a subcommand specifying the action to be | ||
taken. | ||
|
||
Contents | ||
|
||
* [An example](## An example) | ||
* [The `sourmash` command and its subcommands](## The sourmash command and its subcommands) | ||
* [`sourmash compute`](### `sourmash compute`) | ||
|
||
* [`sourmash compare`](### `sourmash compare`) | ||
|
||
* [`sourmash plot`](### `sourmash plot`) | ||
|
||
## An example | ||
|
||
|
||
Grab three bacterial genomes from NCBI: | ||
``` | ||
curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz | ||
curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Salmonella_enterica/reference/GCF_000006945.1_ASM694v1/GCF_000006945.1_ASM694v1_genomic.fna.gz | ||
curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Sphingobacteriaceae_bacterium_DW12/latest_assembly_versions/GCF_000783305.1_ASM78330v1/GCF_000783305.1_ASM78330v1_genomic.fna.gz | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a huge deal, but since other changes are required, should remove the leading spaces here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And in the other code blocks in this file. |
||
``` | ||
Compute signatures for each: | ||
``` | ||
sourmash compute *.fna.gz | ||
``` | ||
This will produce three `.sig` files containing MinHash signatures at k=31. | ||
|
||
Next, compare all the signatures to each other: | ||
``` | ||
sourmash compare *.sig -o cmp | ||
``` | ||
Finally, plot a dendrogram: | ||
``` | ||
sourmash plot cmp | ||
``` | ||
This will output two files, `cmp.dendro.png` and `cmp.matrix.png`, | ||
containing a clustering & dendrogram of the sequences, as well as a | ||
similarity matrix and heatmap. | ||
|
||
## The `sourmash` command and its subcommands | ||
|
||
|
||
To get a list of subcommands, run `sourmash` without any arguments. | ||
|
||
There are three main subcommands: `compute`, `compare`, and `plot`. | ||
|
||
### `sourmash compute` | ||
|
||
|
||
The `compute` subcommand computes and saves MinHash sketches for | ||
each sequence in one or more sequence files. It takes as input FASTA | ||
or FASTQ files, and these files can be uncompressed or compressed with | ||
gzip or bzip2. The output will be one or more JSON signature files | ||
that can be used with `sourmash compare`. | ||
|
||
Usage: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also a copy/paste mistake. Instead of typing this out fully, this should be auto-generated with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My bad, I was mistaken. This is not a directive. |
||
``` | ||
sourmash compute filename [ filename2 ... ] | ||
``` | ||
Optional arguments: | ||
``` | ||
--ksizes K1[,K2,K3] -- one or more k-mer sizes to use; default is 31 | ||
--force -- recompute existing signatures; convert non-DNA characters to N | ||
--output -- save all the signatures to this file; can be '-' for stdout. | ||
``` | ||
### `sourmash compare` | ||
|
||
|
||
The `compare` subcommand compares one or more signature files | ||
(created with `compute`) using estimated [Jaccard index][3]. | ||
The default output | ||
is a text display of a similarity matrix where each entry `[i, j]` | ||
contains the estimated Jaccard index between input signature `i` and | ||
input signature `j`. The output matrix can be saved to a file | ||
with `--output` and used with the `sourmash plot` subcommand. | ||
|
||
Usage: | ||
``` | ||
sourmash compare file1.sig [ file2.sig ... ] | ||
``` | ||
|
||
Options: | ||
``` | ||
--output -- save the distance matrix to this file (as a numpy binary matrix) | ||
--ksize -- do the comparisons at this k-mer size. | ||
``` | ||
|
||
### `sourmash plot` | ||
|
||
|
||
The `plot` subcommand produces two plots -- a dendrogram and a | ||
dendrogram+matrix -- from a distance matrix computed by `sourmash compare | ||
--output <matrix>`. The default output is two PNG files. | ||
|
||
Usage: | ||
``` | ||
sourmash plot <matrix> | ||
``` | ||
|
||
Options: | ||
``` | ||
--pdf -- output PDF files. | ||
--labels -- display the signature names (by default, the filenames) on the plot | ||
--indices -- turn off index display on the plot. | ||
--vmax -- maximum value (default 1.0) for heatmap. | ||
--vmin -- minimum value (deafult 0.0) for heatmap. | ||
--subsample=<N> -- plot a maximum of <N> samples, randomly chosen. | ||
--subsample-seed=<seed> -- seed for pseudorandom number generator. | ||
``` | ||
Example figures: | ||
|
||
<img src="https://github.com/dib-lab/sourmash/blob/master/doc/_static/cmp.matrix.png?raw=true" style="width:60%" /> | ||
|
||
<img src="https://github.com/dib-lab/sourmash/blob/master/doc/_static/cmp.dendro.png?raw=true" style="width:60%" /> | ||
|
||
|
||
[0]:https://en.wikipedia.org/wiki/MinHash | ||
[1]:http://mash.readthedocs.io/en/latest/__ | ||
[2]:http://biorxiv.org/content/early/2015/10/26/029827 | ||
[3]:https://en.wikipedia.org/wiki/Jaccard_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this block isn't formatted correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you probably want the
.. contents::
directive here. You've manually pasted in some links that should be dynamically generated by the documentation build systemThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have been mistaken here too. If the
.. contents::
directive doesn't work in Markdown, we'll have to figure out how to add anchors to the different sections—or simply just delete the contents block!