Fixed handling of ambiguous nucleotides, issue #137 #138

swamidass · 2017-02-26T23:01:13Z

Fixes #137 by ignoring DNA k-mers that contain sequence other than ACGT.

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

This change does change the kmers that sourmash will use (throwing out the ambiguous ones), so it will change the signatures of files with ambiguous kmers. Therefore I did increment the version to 1.1.1 as a signal to users that there is a change.

swamidass · 2017-02-26T23:02:23Z

Also, on my machine, there are 21 failed tests, but this was before the change to the code.

swamidass · 2017-02-26T23:15:40Z

Looks like two small changes needs to be made to the tests (because bad kmers are now being excluded). This is easy to do. If you approve otherwise, I will make the change.

ctb · 2017-02-27T14:37:15Z

A couple of notes --

first, you can leave the checklists as they are, github will make them clickable!

second, we should fix the master branch tests on your system before proceeding. note that they are passing on Travis for both 2.7 and 3.5 (https://travis-ci.org/dib-lab/sourmash) so either we have an error in our install documentation or something wonky is going on on your system.

third, could you add a test (I suggest to test__minhash.py, something along the lines of test_compare_1) that illustrates the new behavior? It should fail if the patch to add_sequence is not present, of course.

thx!

codecov-io · 2017-02-28T03:43:56Z

Codecov Report

Merging #138 into master will decrease coverage by -0.08%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master     #138      +/-   ##
==========================================
- Coverage   87.59%   87.52%   -0.08%     
==========================================
  Files          18       18              
  Lines        2338     2332       -6     
  Branches       51       52       +1     
==========================================
- Hits         2048     2041       -7     
  Misses        282      282              
- Partials        8        9       +1

Impacted Files	Coverage Δ
sourmash_lib/kmer_min_hash.hh	`90.28% <87.5%> (-0.88%)`	❌

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4062ddf...fb7a6c1. Read the comment docs.

swamidass · 2017-02-28T03:45:32Z

So it looks like the fix works now on Travis. As a separate issue, (even before this fix), the build does not pass on my MacOS laptop. Not sure how to fix that.

ctb · 2017-02-28T05:55:36Z

Thanks, I'll take a look! re your tests, can you post some of the error messages from the failing tests?

swamidass · 2017-02-28T06:06:02Z

See #139. Seems like a lot of save/load errors (which probably totally breaks sourmash here.

ctb · 2017-03-01T04:59:47Z

@swamidass please see swamidass#1 - I realized that the _forcedna logic was no longer useful, although it introduces new behavior where valid k-mers are added before the exception is raised. What do you think?

We could always add a _checkdna call before the for loop, but then we'd be looking at the entire sequence. shrug

Update PR to remove some now-unnecessary code, add a test

ctb · 2017-03-01T05:03:59Z

@betatim @luizirber could you take a quick look to make sure we didn't miss something? I think this is ready for review & merge.

swamidass · 2017-03-01T05:04:28Z

Up to you in that behavior. I merged it in, but that behavior needs to be documented. It might be better of any not standard nucleotide just gets skipped, rather than raising an exception. That is probably the best behaivior. What do you think?

betatim · 2017-03-01T06:44:30Z

sourmash_lib/test__minhash.py

    with pytest.raises(ValueError):
-        mh.add_sequence('ATGR')
+        mh.add_sequence('ATGR') 


spurious whitespace.

Fixed in swamidass#2

betatim · 2017-03-01T06:46:24Z

sourmash_lib/test__minhash.py

@@ -176,14 +177,27 @@ def test_basic_dna_bad_2(track_abundance):

 def test_basic_dna_bad_force(track_abundance):
    # test behavior on bad DNA
-    mh = MinHash(1, 4, track_abundance=track_abundance)
+    mh = MinHash(100, 4, track_abundance=track_abundance)


For my curiosity: why the change from 1 to 100?

Added comment in swamidass#2 - we want to store multiple hashes.

setup.py

betatim · 2017-03-01T07:03:05Z

sourmash_lib/kmer_min_hash.hh

-    std::string _checkdna(const char * s, bool force=false) const {
-        std::string seq = s;
-        const size_t seqsize = strlen(s);
+    bool _checkdna(std::string seq) const {


-> const std::string seq

Fixed in swamidass#2

betatim · 2017-03-01T07:06:54Z

sourmash_lib/kmer_min_hash.hh


-        for (size_t i=0; i < seqsize; ++i) {
+        for (size_t i=0; i < seq.length(); ++i) {


Would be nice to switch to for (auto b : seq) {...} as we touch this code. Up to you.

Not done yet.

betatim · 2017-03-01T07:09:03Z

There should be some docs to update for this change. If not we should add to the docs what the new behaviour is.

I've looked at the code/technical part. Whether this is the thing to do or not for science I'll leave to someone else.

betatim · 2017-03-01T07:13:59Z

Did some quick scanning of the docs and found these places that mention ACGTN characters and behaviour:

betatim · 2017-03-01T07:17:43Z

Can't see the output of travis ATM, I think this is because they use S3 to store them. Will check back later.

ctb · 2017-03-01T13:49:52Z

Skipping non-standard nucleotides is the default behavior from the command line 'compute' call now. The API does not do so by default. That seems OK to me.

swamidass · 2017-03-07T03:12:36Z

What are we waiting for here?

Given the agreement about incrementing version. The only addition change I could make is changing the increment from 1.1 -> 1.1.1 to either 1.2 or 2.0.

Please let me know what you suggest, or please merge the change. Thanks.

ctb · 2017-03-07T11:51:17Z

hi @swamidass, @betatim put in a few comments above. Those should be addressed (or discussed if you disagree). I'd be happy to do that, but I'm in traveling and also in the middle of writing a grant, so it may take me another few days.

Is there any particular hurry? You (and others) can use this branch as-is and we're not trying to make a new release any time soon.

betatim · 2017-03-08T14:54:05Z

I am -1 on updating the version. IMO this should happen "as we tag" the next release.

I'd still update the docs to make them more clear on what happens. "ignore non ACGT" is kinda vague, especially if we care at the level of "ignore per kmer, ignore per read". 'here we set ‘force=True’ in add_sequence to ignore non-ACTGN characters' -> 'here we set ‘force=True’ in add_sequence to what-we-actually-do when encountering a non-ACTGN character in a kmer' or some such.

ctb · 2017-03-21T13:08:36Z

I decided to update the version to 2.0-alpha, mostly to indicate that it is alpha :)

ctb · 2017-03-21T13:09:24Z

@swamidass if you merge in swamidass#2, I think we're ready for merge.

ctb · 2017-03-21T13:14:21Z

I also updated the docs in swamidass#2 so that it's clear that k-mers containing non-ACGT are skipped in Python.

betatim · 2017-03-21T14:14:43Z

👍

Fixed handing of ambiguous nucleotides, issue sourmash-bio#137

b6d0c7f

swamidass added 2 commits February 27, 2017 21:08

added tests

2896e9e

error on test.

48a7052

swamidass mentioned this pull request Feb 28, 2017

Ambiguous Nucleotide handling question? #137

Closed

ctb added 3 commits February 28, 2017 20:37

Merge branch 'master' of github.com:dib-lab/sourmash into sw_master

1fecd6e

remove now-unnecessary _forcedna function

e992859

added test for bad k-mers in middle of sequence

e2b0f9e

ctb mentioned this pull request Mar 1, 2017

Update PR to remove some now-unnecessary code, add a test swamidass/sourmash#1

Merged

Merge pull request #1 from dib-lab/fix/bad_kmers

fb7a6c1

Update PR to remove some now-unnecessary code, add a test

betatim reviewed Mar 1, 2017

View reviewed changes

setup.py Show resolved Hide resolved

betatim reviewed Mar 1, 2017

View reviewed changes

ctb mentioned this pull request Mar 21, 2017

Update handling of ambiguous nucleotides swamidass/sourmash#2

Closed

5 tasks

ctb mentioned this pull request Mar 27, 2017

Fix handling of ambiguous nucleotides #148

Merged

5 tasks

betatim merged commit fb7a6c1 into sourmash-bio:master Mar 27, 2017

ctb mentioned this pull request Jan 12, 2019

Update author list for 4.0 #622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed handling of ambiguous nucleotides, issue #137 #138

Fixed handling of ambiguous nucleotides, issue #137 #138

swamidass commented Feb 26, 2017 •

edited

Loading

swamidass commented Feb 26, 2017

swamidass commented Feb 26, 2017

ctb commented Feb 27, 2017

codecov-io commented Feb 28, 2017 •

edited

Loading

swamidass commented Feb 28, 2017

ctb commented Feb 28, 2017

swamidass commented Feb 28, 2017

ctb commented Mar 1, 2017

ctb commented Mar 1, 2017

swamidass commented Mar 1, 2017

betatim Mar 1, 2017

ctb Mar 21, 2017

betatim Mar 1, 2017

ctb Mar 21, 2017

betatim Mar 1, 2017

ctb Mar 21, 2017

betatim Mar 1, 2017

ctb Mar 21, 2017

betatim commented Mar 1, 2017

betatim commented Mar 1, 2017

betatim commented Mar 1, 2017

ctb commented Mar 1, 2017 via email

swamidass commented Mar 7, 2017

ctb commented Mar 7, 2017

betatim commented Mar 8, 2017

ctb commented Mar 21, 2017

ctb commented Mar 21, 2017

ctb commented Mar 21, 2017

betatim commented Mar 21, 2017


		for (size_t i=0; i < seqsize; ++i) {
		for (size_t i=0; i < seq.length(); ++i) {

Fixed handling of ambiguous nucleotides, issue #137 #138

Fixed handling of ambiguous nucleotides, issue #137 #138

Conversation

swamidass commented Feb 26, 2017 • edited Loading

swamidass commented Feb 26, 2017

swamidass commented Feb 26, 2017

ctb commented Feb 27, 2017

codecov-io commented Feb 28, 2017 • edited Loading

Codecov Report

swamidass commented Feb 28, 2017

ctb commented Feb 28, 2017

swamidass commented Feb 28, 2017

ctb commented Mar 1, 2017

ctb commented Mar 1, 2017

swamidass commented Mar 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betatim commented Mar 1, 2017

betatim commented Mar 1, 2017

betatim commented Mar 1, 2017

ctb commented Mar 1, 2017 via email

swamidass commented Mar 7, 2017

ctb commented Mar 7, 2017

betatim commented Mar 8, 2017

ctb commented Mar 21, 2017

ctb commented Mar 21, 2017

ctb commented Mar 21, 2017

betatim commented Mar 21, 2017

swamidass commented Feb 26, 2017 •

edited

Loading

codecov-io commented Feb 28, 2017 •

edited

Loading