Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed handling of ambiguous nucleotides, issue #137 #138

Merged
merged 7 commits into from
Mar 27, 2017

Conversation

swamidass
Copy link
Contributor

@swamidass swamidass commented Feb 26, 2017

Fixes #137 by ignoring DNA k-mers that contain sequence other than ACGT.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

This change does change the kmers that sourmash will use (throwing out the ambiguous ones), so it will change the signatures of files with ambiguous kmers. Therefore I did increment the version to 1.1.1 as a signal to users that there is a change.

@swamidass
Copy link
Contributor Author

Also, on my machine, there are 21 failed tests, but this was before the change to the code.

@swamidass
Copy link
Contributor Author

Looks like two small changes needs to be made to the tests (because bad kmers are now being excluded). This is easy to do. If you approve otherwise, I will make the change.

@ctb
Copy link
Contributor

ctb commented Feb 27, 2017

A couple of notes --

first, you can leave the checklists as they are, github will make them clickable!

second, we should fix the master branch tests on your system before proceeding. note that they are passing on Travis for both 2.7 and 3.5 (https://travis-ci.org/dib-lab/sourmash) so either we have an error in our install documentation or something wonky is going on on your system.

third, could you add a test (I suggest to test__minhash.py, something along the lines of test_compare_1) that illustrates the new behavior? It should fail if the patch to add_sequence is not present, of course.

thx!

@codecov-io
Copy link

codecov-io commented Feb 28, 2017

Codecov Report

Merging #138 into master will decrease coverage by -0.08%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master     #138      +/-   ##
==========================================
- Coverage   87.59%   87.52%   -0.08%     
==========================================
  Files          18       18              
  Lines        2338     2332       -6     
  Branches       51       52       +1     
==========================================
- Hits         2048     2041       -7     
  Misses        282      282              
- Partials        8        9       +1
Impacted Files Coverage Δ
sourmash_lib/kmer_min_hash.hh 90.28% <87.5%> (-0.88%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4062ddf...fb7a6c1. Read the comment docs.

@swamidass
Copy link
Contributor Author

So it looks like the fix works now on Travis. As a separate issue, (even before this fix), the build does not pass on my MacOS laptop. Not sure how to fix that.

@ctb
Copy link
Contributor

ctb commented Feb 28, 2017

Thanks, I'll take a look! re your tests, can you post some of the error messages from the failing tests?

@swamidass
Copy link
Contributor Author

See #139. Seems like a lot of save/load errors (which probably totally breaks sourmash here.

@ctb
Copy link
Contributor

ctb commented Mar 1, 2017

@swamidass please see swamidass#1 - I realized that the _forcedna logic was no longer useful, although it introduces new behavior where valid k-mers are added before the exception is raised. What do you think?

We could always add a _checkdna call before the for loop, but then we'd be looking at the entire sequence. shrug

Update PR to remove some now-unnecessary code, add a test
@ctb
Copy link
Contributor

ctb commented Mar 1, 2017

@betatim @luizirber could you take a quick look to make sure we didn't miss something? I think this is ready for review & merge.

@swamidass
Copy link
Contributor Author

Up to you in that behavior. I merged it in, but that behavior needs to be documented. It might be better of any not standard nucleotide just gets skipped, rather than raising an exception. That is probably the best behaivior. What do you think?

with pytest.raises(ValueError):
mh.add_sequence('ATGR')
mh.add_sequence('ATGR')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious whitespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in swamidass#2

@@ -176,14 +177,27 @@ def test_basic_dna_bad_2(track_abundance):

def test_basic_dna_bad_force(track_abundance):
# test behavior on bad DNA
mh = MinHash(1, 4, track_abundance=track_abundance)
mh = MinHash(100, 4, track_abundance=track_abundance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my curiosity: why the change from 1 to 100?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment in swamidass#2 - we want to store multiple hashes.

setup.py Show resolved Hide resolved
std::string _checkdna(const char * s, bool force=false) const {
std::string seq = s;
const size_t seqsize = strlen(s);
bool _checkdna(std::string seq) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> const std::string seq

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in swamidass#2


for (size_t i=0; i < seqsize; ++i) {
for (size_t i=0; i < seq.length(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to switch to for (auto b : seq) {...} as we touch this code. Up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done yet.

@betatim
Copy link
Contributor

betatim commented Mar 1, 2017

There should be some docs to update for this change. If not we should add to the docs what the new behaviour is.

I've looked at the code/technical part. Whether this is the thing to do or not for science I'll leave to someone else.

@betatim
Copy link
Contributor

betatim commented Mar 1, 2017

Did some quick scanning of the docs and found these places that mention ACGTN characters and behaviour:

@betatim
Copy link
Contributor

betatim commented Mar 1, 2017

Can't see the output of travis ATM, I think this is because they use S3 to store them. Will check back later.

@ctb
Copy link
Contributor

ctb commented Mar 1, 2017 via email

@swamidass
Copy link
Contributor Author

What are we waiting for here?

Given the agreement about incrementing version. The only addition change I could make is changing the increment from 1.1 -> 1.1.1 to either 1.2 or 2.0.

Please let me know what you suggest, or please merge the change. Thanks.

@ctb
Copy link
Contributor

ctb commented Mar 7, 2017

hi @swamidass, @betatim put in a few comments above. Those should be addressed (or discussed if you disagree). I'd be happy to do that, but I'm in traveling and also in the middle of writing a grant, so it may take me another few days.

Is there any particular hurry? You (and others) can use this branch as-is and we're not trying to make a new release any time soon.

@betatim
Copy link
Contributor

betatim commented Mar 8, 2017

I am -1 on updating the version. IMO this should happen "as we tag" the next release.

I'd still update the docs to make them more clear on what happens. "ignore non ACGT" is kinda vague, especially if we care at the level of "ignore per kmer, ignore per read". 'here we set ‘force=True’ in add_sequence to ignore non-ACTGN characters' -> 'here we set ‘force=True’ in add_sequence to what-we-actually-do when encountering a non-ACTGN character in a kmer' or some such.

@ctb
Copy link
Contributor

ctb commented Mar 21, 2017

I decided to update the version to 2.0-alpha, mostly to indicate that it is alpha :)

@ctb
Copy link
Contributor

ctb commented Mar 21, 2017

@swamidass if you merge in swamidass#2, I think we're ready for merge.

@ctb
Copy link
Contributor

ctb commented Mar 21, 2017

I also updated the docs in swamidass#2 so that it's clear that k-mers containing non-ACGT are skipped in Python.

@betatim
Copy link
Contributor

betatim commented Mar 21, 2017

👍

@ctb ctb mentioned this pull request Mar 27, 2017
5 tasks
@betatim betatim merged commit fb7a6c1 into sourmash-bio:master Mar 27, 2017
@ctb ctb mentioned this pull request Jan 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants