-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed handling of ambiguous nucleotides, issue #137 #138
Conversation
Also, on my machine, there are 21 failed tests, but this was before the change to the code. |
Looks like two small changes needs to be made to the tests (because bad kmers are now being excluded). This is easy to do. If you approve otherwise, I will make the change. |
A couple of notes -- first, you can leave the checklists as they are, github will make them clickable! second, we should fix the master branch tests on your system before proceeding. note that they are passing on Travis for both 2.7 and 3.5 (https://travis-ci.org/dib-lab/sourmash) so either we have an error in our install documentation or something wonky is going on on your system. third, could you add a test (I suggest to thx! |
Codecov Report
@@ Coverage Diff @@
## master #138 +/- ##
==========================================
- Coverage 87.59% 87.52% -0.08%
==========================================
Files 18 18
Lines 2338 2332 -6
Branches 51 52 +1
==========================================
- Hits 2048 2041 -7
Misses 282 282
- Partials 8 9 +1
Continue to review full report at Codecov.
|
So it looks like the fix works now on Travis. As a separate issue, (even before this fix), the build does not pass on my MacOS laptop. Not sure how to fix that. |
Thanks, I'll take a look! re your tests, can you post some of the error messages from the failing tests? |
See #139. Seems like a lot of save/load errors (which probably totally breaks sourmash here. |
@swamidass please see swamidass#1 - I realized that the We could always add a |
Update PR to remove some now-unnecessary code, add a test
@betatim @luizirber could you take a quick look to make sure we didn't miss something? I think this is ready for review & merge. |
Up to you in that behavior. I merged it in, but that behavior needs to be documented. It might be better of any not standard nucleotide just gets skipped, rather than raising an exception. That is probably the best behaivior. What do you think? |
with pytest.raises(ValueError): | ||
mh.add_sequence('ATGR') | ||
mh.add_sequence('ATGR') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in swamidass#2
@@ -176,14 +177,27 @@ def test_basic_dna_bad_2(track_abundance): | |||
|
|||
def test_basic_dna_bad_force(track_abundance): | |||
# test behavior on bad DNA | |||
mh = MinHash(1, 4, track_abundance=track_abundance) | |||
mh = MinHash(100, 4, track_abundance=track_abundance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my curiosity: why the change from 1
to 100
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment in swamidass#2 - we want to store multiple hashes.
std::string _checkdna(const char * s, bool force=false) const { | ||
std::string seq = s; | ||
const size_t seqsize = strlen(s); | ||
bool _checkdna(std::string seq) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> const std::string seq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in swamidass#2
|
||
for (size_t i=0; i < seqsize; ++i) { | ||
for (size_t i=0; i < seq.length(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to switch to for (auto b : seq) {...}
as we touch this code. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not done yet.
There should be some docs to update for this change. If not we should add to the docs what the new behaviour is. I've looked at the code/technical part. Whether this is the thing to do or not for science I'll leave to someone else. |
Did some quick scanning of the docs and found these places that mention ACGTN characters and behaviour: |
Can't see the output of travis ATM, I think this is because they use S3 to store them. Will check back later. |
Skipping non-standard nucleotides is the default behavior from the command
line 'compute' call now. The API does not do so by default. That seems OK
to me.
|
What are we waiting for here? Given the agreement about incrementing version. The only addition change I could make is changing the increment from 1.1 -> 1.1.1 to either 1.2 or 2.0. Please let me know what you suggest, or please merge the change. Thanks. |
hi @swamidass, @betatim put in a few comments above. Those should be addressed (or discussed if you disagree). I'd be happy to do that, but I'm in traveling and also in the middle of writing a grant, so it may take me another few days. Is there any particular hurry? You (and others) can use this branch as-is and we're not trying to make a new release any time soon. |
I am -1 on updating the version. IMO this should happen "as we tag" the next release. I'd still update the docs to make them more clear on what happens. "ignore non ACGT" is kinda vague, especially if we care at the level of "ignore per kmer, ignore per read". 'here we set ‘force=True’ in add_sequence to ignore non-ACTGN characters' -> 'here we set ‘force=True’ in add_sequence to what-we-actually-do when encountering a non-ACTGN character in a kmer' or some such. |
I decided to update the version to 2.0-alpha, mostly to indicate that it is alpha :) |
@swamidass if you merge in swamidass#2, I think we're ready for merge. |
I also updated the docs in swamidass#2 so that it's clear that k-mers containing non-ACGT are skipped in Python. |
👍 |
Fixes #137 by ignoring DNA k-mers that contain sequence other than ACGT.
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?
This change does change the kmers that sourmash will use (throwing out the ambiguous ones), so it will change the signatures of files with ambiguous kmers. Therefore I did increment the version to 1.1.1 as a signal to users that there is a change.