Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing metagenomes help? #294

Closed
rachelporetsky opened this issue Jul 13, 2017 · 5 comments
Closed

Comparing metagenomes help? #294

rachelporetsky opened this issue Jul 13, 2017 · 5 comments

Comments

@rachelporetsky
Copy link

rachelporetsky commented Jul 13, 2017

I have data from a single water sample that was filtered through a big filter followed by a small filter to capture attached vs. free-living microbes. We assembled the sequences from both sets of filters together as well as separately and our assemblies are different.

  1. Do you recommend comparing both reads and assemblies to GenBank microbial genomes? Or to each other (i.e., assemblies to assemblies and reads to reads) or reads from one to assemblies from the other?
  2. You mentioned k-mer trimming first to get Jaccard distances when comparing reads to reads-- how do I do this with sourmash?
@ctb
Copy link
Contributor

ctb commented Feb 26, 2018

Partially done in #419, "a practical guide."

@ctb
Copy link
Contributor

ctb commented Apr 4, 2020

I think this has been addressed now.

@ctb ctb closed this as completed Apr 4, 2020
@ReneKat
Copy link

ReneKat commented Jun 7, 2020

Helle @ctb and sourmash Team!

I have been through the tutorials and Practical Guide which have all been extremely helpful. However, I have reached a snag I was hoping to get some guidance on.

I am wanting to use sourmash to compare 40 metagenomic environmental water samples: 20 sampling sites over 2 seasons. I have assembled reads from each sample using metaSPAdes and computed signatures for each assembly using combos of k= 21, 31, 51 and scaled= 10, 100, 1000, 10000:

sourmash compute -k 21 --scaled=1000 ${prefix}_MS_scaffolds.fasta --merge ${prefix} -o ${prefix}_k21_1000.sig

sourmash compare *.sig -k 21 -o ./samples01_21_1000
sourmash plot --labels samples01_21_1000

It is known from 16S DNAseq data that some sites should be clustering, especially the same site sampled at different time periods. However, my matrix plot is barely clustering any sites regardless of the k-mer + scaled combination.
The sampling coverage for each sample is low, on average only 3X, with the per sample N50 between 550bp-1200bp.

Is it possible that I'm computing the assembly signatures wrong? I was unsure if I should compute signatures on the reads or the contigs.

sourmash sig describe BR1_2018_k21_1000.sig

== This is sourmash version 3.3.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 signatures total.

signature filename: BR1_2018_k21_1000.sig
signature: BR1_2018
source file: BR1_2018_MS_scaffolds.fasta
md5: 13ffdebef827e79bd5a3e92ec4431ae0
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 3679
signature license: CC0

Please let me know if more information is needed. I appreciate your assistance in using sourmash for my project.

Best Regards,
René

@ctb
Copy link
Contributor

ctb commented Jun 22, 2020

hi @ReneKat sorry for delay in responding - in the future, just file a new issue, that way it pops up when I'm going through the issue tracker :)

@ctb ctb reopened this Jun 22, 2020
@ctb
Copy link
Contributor

ctb commented Jun 22, 2020

a few thoughts --

  • assembly may be eliminating a lot of your sample, due to the low coverage
  • Jaccard similarity is really stringent compared to 16s. you're taking into account not just the shared k-mers, but all of the k-mers (including those that are not shared). with metagenomes, we tend to see the behavior I think you're describing.
  • when you say your samples are "barely clustering", is that a description of the dendrogram lengths or of the shading in the plot matrix? The latter can be adjusted with --vmax and I often go down to a vmax of 0.1 in order to see trace similarities.

I would suggest sticking with k=21 and a scaled of 1000, and applying that to the reads rather than the assembly. If you e-mail me the .sig files at [email protected] I can poke at them a bit and see if I can find something you're missing.

@ctb ctb closed this as completed Jan 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants