Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change .rst to .md Fix/#340 #356

Merged
merged 10 commits into from
Oct 27, 2017
70 changes: 36 additions & 34 deletions doc/api-example.rst → doc/api-example.md
Original file line number Diff line number Diff line change
@@ -1,104 +1,106 @@
============================
``sourmash`` Python examples
============================

A very simple example: two k-mers
---------------------------------
# `sourmash` Python examples


## A very simple example: two k-mers

Define two sequences:

Define two sequences:
```
>>> seq1 = "ATGGCA"
>>> seq2 = "AGAGCA"

```
Create two minhashes using 3-mers, and add the sequences:

```
>>> import sourmash_lib
>>> E1 = sourmash_lib.MinHash(n=20, ksize=3)
>>> E1.add_sequence(seq1)

```
```
>>> E2 = sourmash_lib.MinHash(n=20, ksize=3)
>>> E2.add_sequence(seq2)

```
One of the 3-mers (out of 7) overlaps, so Jaccard index is 1/7:

```
>>> round(E1.jaccard(E2), 2)
0.14

```
and of course the minhashes match themselves:

```
>>> E1.jaccard(E1)
1.0

```
We can add sequences and query at any time --

```
>>> E1.add_sequence(seq2)
>>> x = E1.jaccard(E2)
>>> round(x, 3)
0.571
```
## Consuming files

Consuming files
---------------

Suppose we want to create MinHash sketches from genomes --

```
>>> import glob, pprint
>>> genomes = glob.glob('data/GCF*.fna.gz')
>>> genomes = list(sorted(genomes))
>>> pprint.pprint(genomes)
['data/GCF_000005845.2_ASM584v2_genomic.fna.gz',
'data/GCF_000006945.1_ASM694v1_genomic.fna.gz',
'data/GCF_000783305.1_ASM78330v1_genomic.fna.gz']

```
We have to read them in (here using screed), but then they can be fed
into 'add_sequence' directly; here we set 'force=True' in ``add_sequence``
into `add_sequence` directly; here we set `force=True` in `add_sequence`
to skip over k-mers containing characters other than ACTG, rather than
raising an exception.

(Note, just for speed reasons, we'll truncate the sequences to 50kb in length.)

```
>>> import screed
>>> minhashes = []
>>> for g in genomes:
... E = sourmash_lib.MinHash(n=500, ksize=31)
... for record in screed.open(g):
... E.add_sequence(record.sequence[:50000], True)
... minhashes.append(E)

```
And now the minhashes can be compared against each other:

```
>>> import sys
>>> for i, e in enumerate(minhashes):
... _ = sys.stdout.write(genomes[i][:20] + ' ')
... for j, e2 in enumerate(minhashes):
... x = e.jaccard(minhashes[j])
... _ = sys.stdout.write(str(round(x, 3)) + ' ')
... _= sys.stdout.write('\n')
data/GCF_000005845.2 1.0 0.0 0.0
data/GCF_000006945.1 0.0 1.0 0.0
data/GCF_000783305.1 0.0 0.0 1.0

data/GCF_000005845.2 1.0 0.0 0.0
data/GCF_000006945.1 0.0 1.0 0.0
data/GCF_000783305.1 0.0 0.0 1.0
```
Note that the comparisons are quite quick; most of the time is spent in
making the minhashes, which can be saved and loaded easily.

Saving and loading signature files
----------------------------------
## Saving and loading signature files

```
>>> from sourmash_lib import signature
>>> sig1 = signature.SourmashSignature('[email protected]', minhashes[0],
... name=genomes[0][:20])
>>> with open('/tmp/genome1.sig', 'wt') as fp:
... signature.save_signatures([sig1], fp)

Here, ``/tmp/genome1.sig`` is a JSON file that can now be loaded and
```
Here, `/tmp/genome1.sig` is a JSON file that can now be loaded and
compared -- first, load:

```
>>> sigfp = open('/tmp/genome1.sig', 'rt')
>>> siglist = list(signature.load_signatures(sigfp))
>>> loaded_sig = siglist[0]

```
then compare:

```
>>> loaded_sig.minhash.jaccard(sig1.minhash)
1.0
>>> sig1.minhash.jaccard(loaded_sig.minhash)
1.0
```
30 changes: 30 additions & 0 deletions doc/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# `sourmash` Python API


The primary programmatic way of interacting with `sourmash` is via
its Python API. (The core MinHash Python API closely mirrors the
underlying C++ code, but for now this is undocumented.)

Contents
* [`MinHash`: basic MinHash sketch functionality](# `MinHash`: basic MinHash sketch functionality)
* [`SourmashSignature`: save and load MinHash sketches in JSON](# `SourmashSignature`: save and load MinHash sketches in JSON)
* [`sourmash_lib.fig`: make plots and figures](# `sourmash_lib.fig`: make plots and figures)

## `MinHash`: basic MinHash sketch functionality

```
.. automodule:: sourmash_lib
:members:
```
## `SourmashSignature`: save and load MinHash sketches in JSON

```
.. automodule:: sourmash_lib.signature
:members:
```
## `sourmash_lib.fig`: make plots and figures

```
.. automodule:: sourmash_lib.fig
:members:
```
29 changes: 0 additions & 29 deletions doc/api.rst

This file was deleted.

138 changes: 138 additions & 0 deletions doc/command-line.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@

# Using sourmash from the command line


From the command line, sourmash can be used to compute [MinHash
sketches][0] from DNA
sequences, compare them to each other, and plot the results. This
allows you to estimate sequence similarity quickly and accurately.

Please see the [mash software][1] http://mash.readthedocs.io/en/latest/__
and the [mash paper (Ondov et al., 2016)][2]
for background
information on how and why MinHash sketches work.

______

sourmash uses a subcommand syntax, so all commands start with
`sourmash` followed by a subcommand specifying the action to be
taken.

Contents

* [An example](## An example)
* [The `sourmash` command and its subcommands](## The sourmash command and its subcommands)
* [`sourmash compute`](### `sourmash compute`)

* [`sourmash compare`](### `sourmash compare`)

* [`sourmash plot`](### `sourmash plot`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this block isn't formatted correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you probably want the .. contents:: directive here. You've manually pasted in some links that should be dynamically generated by the documentation build system

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have been mistaken here too. If the .. contents:: directive doesn't work in Markdown, we'll have to figure out how to add anchors to the different sections—or simply just delete the contents block!


## An example


Grab three bacterial genomes from NCBI:
```
curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Salmonella_enterica/reference/GCF_000006945.1_ASM694v1/GCF_000006945.1_ASM694v1_genomic.fna.gz
curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Sphingobacteriaceae_bacterium_DW12/latest_assembly_versions/GCF_000783305.1_ASM78330v1/GCF_000783305.1_ASM78330v1_genomic.fna.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but since other changes are required, should remove the leading spaces here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in the other code blocks in this file.

```
Compute signatures for each:
```
sourmash compute *.fna.gz
```
This will produce three `.sig` files containing MinHash signatures at k=31.

Next, compare all the signatures to each other:
```
sourmash compare *.sig -o cmp
```
Finally, plot a dendrogram:
```
sourmash plot cmp
```
This will output two files, `cmp.dendro.png` and `cmp.matrix.png`,
containing a clustering & dendrogram of the sequences, as well as a
similarity matrix and heatmap.

## The `sourmash` command and its subcommands


To get a list of subcommands, run `sourmash` without any arguments.

There are three main subcommands: `compute`, `compare`, and `plot`.

### `sourmash compute`


The `compute` subcommand computes and saves MinHash sketches for
each sequence in one or more sequence files. It takes as input FASTA
or FASTQ files, and these files can be uncompressed or compressed with
gzip or bzip2. The output will be one or more JSON signature files
that can be used with `sourmash compare`.

Usage:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a copy/paste mistake. Instead of typing this out fully, this should be auto-generated with the .. Usage:: directive. That way changes to the code are automatically reflected in the documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I was mistaken. This is not a directive.

```
sourmash compute filename [ filename2 ... ]
```
Optional arguments:
```
--ksizes K1[,K2,K3] -- one or more k-mer sizes to use; default is 31
--force -- recompute existing signatures; convert non-DNA characters to N
--output -- save all the signatures to this file; can be '-' for stdout.
```
### `sourmash compare`


The `compare` subcommand compares one or more signature files
(created with `compute`) using estimated [Jaccard index][3].
The default output
is a text display of a similarity matrix where each entry `[i, j]`
contains the estimated Jaccard index between input signature `i` and
input signature `j`. The output matrix can be saved to a file
with `--output` and used with the `sourmash plot` subcommand.

Usage:
```
sourmash compare file1.sig [ file2.sig ... ]
```

Options:
```
--output -- save the distance matrix to this file (as a numpy binary matrix)
--ksize -- do the comparisons at this k-mer size.
```

### `sourmash plot`


The `plot` subcommand produces two plots -- a dendrogram and a
dendrogram+matrix -- from a distance matrix computed by `sourmash compare
--output <matrix>`. The default output is two PNG files.

Usage:
```
sourmash plot <matrix>
```

Options:
```
--pdf -- output PDF files.
--labels -- display the signature names (by default, the filenames) on the plot
--indices -- turn off index display on the plot.
--vmax -- maximum value (default 1.0) for heatmap.
--vmin -- minimum value (deafult 0.0) for heatmap.
--subsample=<N> -- plot a maximum of <N> samples, randomly chosen.
--subsample-seed=<seed> -- seed for pseudorandom number generator.
```
Example figures:

<img src="https://github.com/dib-lab/sourmash/blob/master/doc/_static/cmp.matrix.png?raw=true" style="width:60%" />

<img src="https://github.com/dib-lab/sourmash/blob/master/doc/_static/cmp.dendro.png?raw=true" style="width:60%" />


[0]:https://en.wikipedia.org/wiki/MinHash
[1]:http://mash.readthedocs.io/en/latest/__
[2]:http://biorxiv.org/content/early/2015/10/26/029827
[3]:https://en.wikipedia.org/wiki/Jaccard_index
Loading