Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oxidize lca_db #1131

Closed
wants to merge 20 commits into from
Closed

Oxidize lca_db #1131

wants to merge 20 commits into from

Conversation

erikyoung85
Copy link
Contributor

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@erikyoung85 erikyoung85 linked an issue Jul 27, 2020 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jul 27, 2020

Codecov Report

Merging #1131 into latest will decrease coverage by 1.09%.
The diff coverage is 74.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1131      +/-   ##
==========================================
- Coverage   84.13%   83.04%   -1.10%     
==========================================
  Files          99      101       +2     
  Lines        9218     9948     +730     
==========================================
+ Hits         7756     8261     +505     
- Misses       1462     1687     +225     
Flag Coverage Δ
#rusttests 69.16% <72.02%> (+1.25%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/errors.rs 0.00% <0.00%> (ø)
src/core/src/ffi/lca_db.rs 0.00% <0.00%> (ø)
src/core/src/ffi/mod.rs 0.00% <ø> (ø)
src/core/src/index/mod.rs 61.53% <ø> (ø)
src/core/src/sketch/minhash.rs 92.60% <52.17%> (-0.59%) ⬇️
src/core/src/index/lca_db.rs 86.42% <86.42%> (ø)
sourmash/lca/lca_db.py 95.93% <98.92%> (-1.28%) ⬇️
sourmash/lca/command_gather.py 84.00% <100.00%> (-0.82%) ⬇️
sourmash/lca/command_rankinfo.py 86.36% <100.00%> (-2.53%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a8e5ac...d89edef. Read the comment docs.

sourmash/lca/lca_db.py Outdated Show resolved Hide resolved
@luizirber
Copy link
Member

This is an amazing start!

Extra comment: the cbindgen check failed, I think you wrote the include/sourmash.h manually? You can run make include/sourmash.h and it will use cbindgen to update the headers. More info in the dev docs

@erikyoung85
Copy link
Contributor Author

Thank you for all the help! Also yes I had just forgotten to run make include/sourmash.h one last time to update it before pushing so I thought I could just manually edit it in GitHub. Good to know that's not how it works though!

@luizirber
Copy link
Member

Thank you for all the help! Also yes I had just forgotten to run make include/sourmash.h one last time to update it before pushing so I thought I could just manually edit it in GitHub. Good to know that's not how it works though!

(I put the CI check because I was always messing it up and forgetting to update include/sourmash.h 😸 )

@luizirber
Copy link
Member

I saw you started implementing _signatures on the Rust side, that will be... annoying to send across the FFI. You can see the signature loading code on the Rust side and on the Python side.

Moreover, I think it shouldn't be sent across the FFI in general. Something that happens here is that the data is copied and each copy lives in one language, so if it is not copied (and stays on the Rust side) we use half the memory...

More concretely, what I'm trying to say is: Don't expose the Python methods starting with _, because they are internal anyway. Focus on the public methods (save, load, search, insert, gather), and you probably won't need to expose so much to Python.

Do you agreee @ctb?

@ctb
Copy link
Contributor

ctb commented Jul 27, 2020 via email

@luizirber
Copy link
Member

More concretely, what I'm trying to say is: Don't expose the Python methods starting with _, because they are internal anyway. Focus on the public methods (save, load, search, insert, gather), and you probably won't need to expose so much to Python.

It is also worth pointing out that these methods are exactly the methods defined in the Index abc, as well as the equivalent on the Rust side (the Index trait)

There isn't a fully defined and working index in the Rust side, but the BIGSI prototype might be a good place to check (because it is simpler than the SBT variants)

@erikyoung85
Copy link
Contributor Author

So, I know my code is kind of all over the place right now, but I am having some trouble. If you run py.test tests/test_lca.py::test_summarize_to_root (with the insert(...) not all messed up from other testing, you can see in the captured stdout call that the json that my rust function produces is this:

{"ksize":31,"scaled":10000,"filename":"","moltype":"DNA","_next_index":0,"_next_lid":0,"ident_to_name":{},"ident_to_idx":{},"idx_to_lid":{},"lineage_to_lid":0,"lid_to_lineage":{},"hashval_to_idx":{}}

While the python save function produces this json:

OrderedDict([('version', '2.1'), ('type', 'sourmash_lca'), ('license', 'CC0'), ('ksize', 31), ('scaled', 10000), ('moltype', 'DNA'), ('lid_to_lineage', {0: (LineagePair(rank='superkingdom', name='Bacteria'), LineagePair(rank='phylum', name='Actinobacteria'), LineagePair(rank='class', name='Actinobacteria')), 1: (LineagePair(rank='superkingdom', name='Archaea'), LineagePair(rank='phylum', name='Euryarcheoata'), LineagePair(rank='class', name='unassigned'), LineagePair(rank='order', name='unassigned'), LineagePair(rank='family', name='novelFamily_I'))}), ('hashval_to_idx', {677636055878: [0], 10532770555280: [0], 21323707116253: [0], 23650087715625: [0], 26172103046563: [0], 26416025663777: [0], 32827272301675: [0], 34027997892899: [0], 36621909859455: [0], 38477163578670: [0], 65570422009776: [0], 69948055805402: [0], 72322920475417: [0, 1], 73205278946163: [0], 80487443103162: [0, 1], 86005523509604: [0], 88674429742050: [0], 122275342327700: [0], 122657192016487: [0], 124669419421325: [0], 127379942450416: [0], 135982155721885: [0], 141167198372387: [0], 143233610564679: [0], 146037835579825: [0], 149088436199138: [0], 154296186590415: [0], 163566698849352: [0], 168006515341078: [0], 173613484730321: [0], 175927328346756: [0], 187818979044317: [0], 197497269852505: [0], 225323352564973: [0], 229529589927513: [0], 231592490063953: [0], 246083359950825: [0], 246930056190473: [0], 258171325550743: [0], 266872653019683: [0], 271393622792488: [0], 292237592777531: [0], 294463687174458: [0, 1], 301097385911100: [0], 306907279723793: [0], 315971345465538: [0], 322118929556542: [0, 1], 327926498479880: [0], 327996008288290: [0], 329949136401679: [0, 1], 333501170444664: [0], 340840028907497: [0], 356484201703406: [0], 360520478298478: [0], 373965689901528: [0], 380496788496979: [0], 384971045114358: [0], 408455476549437: [0], 418537032654770: [0], 452668469764051: [0], 454571148747378: [0], 456097783235814: [0], 460967446199335: [0], 464303542748290: [0], 467354489395525: [0], 469368479161635: [0], 484121529768895: [0, 1], 492272432625540: [0], 498431667066412: [0], 503091991500871: [0], 524057246334361: [0], 524169839913687: [0], 554072107807437: [0, 1], 559120046152961: [0, 1], 566268904069704: [0], 568931529038649: [0], 570493161295337: [0], 579322859508178: [0], 585895542804007: [0], 594897004146314: [0, 1], 605809484677014: [0], 606853396955387: [0], 608726514018487: [0], 620285740132375: [0], 628263812008065: [0], 635967823738359: [0], 638041616271469: [0], 653341905911334: [0], 657318659722919: [0], 657693323602493: [0], 672257243390674: [0], 677189579262144: [0], 679765700966803: [0], 688698632142544: [0], 694165516352500: [0], 719640443241057: [0], 723051842737704: [0], 733512402336336: [0], 734277585479279: [0], 737687428617228: [0, 1], 740928265911180: [0], 742732499035659: [0], 749922904334092: [0, 1], 761749290661344: [0], 762080967926036: [0, 1], 777446941363616: [0], 790512741455544: [0], 799943682765610: [0], 826682986140101: [0], 831617511989752: [0], 834125117437721: [0], 845723944231464: [0], 853068169026704: [0], 861633725225086: [0], 869793640345830: [0], 872222674810488: [0], 873141197518956: [0], 875896752893151: [0], 885089476028731: [0], 888795178891357: [0, 1], 904473638200328: [0], 905926272987663: [0], 909715607504390: [0, 1], 910999338164070: [0], 928211917105988: [0], 935996944505539: [0], 950484443151310: [0], 978361670639418: [0], 988698601944372: [0], 988841326130276: [0], 997097293575672: [0], 997886867252960: [0], 1004777540454115: [0], 1010247183930614: [0, 1], 1014647470950684: [0, 1], 1015633919552961: [0], 1021701031915167: [0, 1], 1039974078460375: [0], 1042253594541629: [0], 1044243227196291: [0], 1051254261312122: [0], 1053068789146557: [0], 1068524729760038: [0], 1069259522151029: [0], 1086655873752672: [0], 1098268603781474: [0], 1099460416099821: [0], 1130616772766051: [0, 1], 1132179190652581: [0], 1140460425857898: [0], 1157770929298680: [0], 1177500624327827: [0], 1201583916038072: [0], 1213950438122195: [0], 1218496832319367: [0, 1], 1233697606659939: [0, 1], 1240024106232118: [0], 1260373750402862: [0], 1261298382176817: [0], 1266632373732023: [0], 1297677436162984: [0], 1308937182195042: [0], 1313357506193451: [0], 1314088474185236: [0], 1319215544180508: [0, 1], 1326659730519238: [0], 1332638989279287: [0], 1341489747051762: [0, 1], 1344666390510803: [0], 1358507616166027: [0], 1364526514789664: [0], 1374414410845462: [0], 1379178749359157: [0], 1411795827653081: [0], 1418877090127558: [0], 1422284639820829: [0], 1442391681236064: [0], 1446070912935402: [0], 1451794179742379: [0, 1], 1455176926535326: [0, 1], 1457250282380994: [0, 1], 1471239952699448: [0], 1472018286353104: [0], 1475115102361638: [0, 1], 1485858379436109: [0], 1491132968778757: [0], 1496058912626479: [0], 1514651028551503: [0], 1516231737669559: [0], 1530243797902427: [0, 1], 1538210010050896: [0], 1539188691646823: [0], 1541823434890218: [0], 1544692009598139: [0], 1556646134540043: [0], 1569299506943639: [0], 1577063564886051: [0], 1582327585753236: [0], 1585942565641532: [0], 1588845643148084: [0], 1589471802228022: [0], 1591852458231680: [0], 1595066965487245: [0], 1600603268174083: [0], 1611179808275435: [0], 1617880700081966: [0], 1629175680014346: [0], 1637207156102717: [0], 1644692101510079: [0], 1664361934814586: [0], 1667011068974873: [0], 1675760629311224: [0], 1676162180088113: [0], 1684327880293652: [0], 1687010869501551: [0], 1691473560692150: [0], 1697538954643862: [0], 1700678324584221: [0], 1714715924642011: [0], 1717791883295450: [0], 1718007860258699: [0], 1720151630947142: [0], 1734959730182571: [0], 1747972312176784: [0], 1752607074283337: [0], 1765213439204150: [0], 1772588244577862: [0], 1780614184394756: [0], 1782595500119702: [0], 1790497036442465: [0], 1803697611060216: [0], 1823356122153390: [0], 1826317190915077: [0], 1826910040381509: [0], 6424817699567: [1], 17230694741932: [1], 32421217493157: [1], 45488378809738: [1], 127742693439050: [1], 164517649390910: [1], 201217244473599: [1], 217162288193096: [1], 246762037247108: [1], 262234175368412: [1], 262791161237335: [1], 272505325039917: [1], 277324884532355: [1], 311662175632721: [1], 325215098473953: [1], 387002232314619: [1], 393699253399567: [1], 394768770772846: [1], 466206522727832: [1], 470414528207885: [1], 493893874011097: [1], 527453245826455: [1], 530485938179261: [1], 567229686862002: [1], 585288394533148: [1], 601062183481693: [1], 606163033754391: [1], 608365859974000: [1], 618559876554536: [1], 620744022654408: [1], 624273633519077: [1], 629748800966904: [1], 633336539901794: [1], 647044146011998: [1], 671953628767928: [1], 675747991790996: [1], 693762035805281: [1], 706674702168116: [1], 715624515455477: [1], 756901471752548: [1], 783726333190507: [1], 803651776423768: [1], 809919776498321: [1], 812600135047036: [1], 826810085321987: [1], 844109680777579: [1], 852703926930534: [1], 924584188428876: [1], 928438986469378: [1], 930829744244346: [1], 964792911995463: [1], 982450057864598: [1], 1016575128012012: [1], 1021878034998483: [1], 1025065552182260: [1], 1044109405188955: [1], 1065415986674745: [1], 1078919352502205: [1], 1090666954848274: [1], 1091646104337602: [1], 1093474603859665: [1], 1106791433936686: [1], 1111422879658727: [1], 1130365356387445: [1], 1130766066256384: [1], 1131216747256357: [1], 1161901545764752: [1], 1178273951045225: [1], 1204826369953091: [1], 1223711830342248: [1], 1234577273172292: [1], 1236387731573904: [1], 1261327144420299: [1], 1269832376859503: [1], 1304467121664073: [1], 1319846165007988: [1], 1322204539499908: [1], 1392533349661369: [1], 1469673662960476: [1], 1474383633719149: [1], 1476581670318628: [1], 1556621206408019: [1], 1568529305793924: [1], 1580804078113372: [1], 1604635532663041: [1], 1612129657060372: [1], 1640572833119287: [1], 1645079393158673: [1], 1690300038267201: [1], 1720974846911383: [1], 1725597171425578: [1], 1739165618551272: [1], 1745458554036611: [1], 1746094489285024: [1], 1789484202623589: [1], 1794577466518143: [1], 1809451167939609: [1], 1829011946314518: [1], 1829968336033223: [1]}), ('ident_to_name', {'TARA_MED_MAG_00029': 'TARA_MED_MAG_00029', 'TOBG_MED-875': 'TOBG_MED-875'}), ('ident_to_idx', {'TARA_MED_MAG_00029': 0, 'TOBG_MED-875': 1}), ('idx_to_lid', {0: 0, 1: 1})]) 

What I interpret from this is either my serialize function is wrong, or the dictionaries in python are not transferring to the HashMaps in rust. I've scoured the internet and other structures you have already built in rust but haven't been able to make anything work. Is there something I'm missing?

Also passing the lineage parameter in the insert(...) function in lca_db.py has been giving me some grief. I think it is because it is a bunch of LineagePair tuples which contain strings. Would the best thing to do to be able to pass this as an argument be to convert all of these strings to bytes? Depending on how long the list of tuples there are I would assume that would take quite a bit of time to do in Python but maybe not :)

Any and all help is very very welcome

@luizirber luizirber changed the base branch from master to latest August 5, 2020 02:15
@luizirber
Copy link
Member

So, I know my code is kind of all over the place right now, but I am having some trouble. If you run py.test tests/test_lca.py::test_summarize_to_root (with the insert(...) not all messed up from other testing, you can see in the captured stdout call that the json that my rust function produces is this:

{"ksize":31,"scaled":10000,"filename":"","moltype":"DNA","_next_index":0,"_next_lid":0,"ident_to_name":{},"ident_to_idx":{},"idx_to_lid":{},"lineage_to_lid":0,"lid_to_lineage":{},"hashval_to_idx":{}}

note here that _next_index and _next_lid are internal values and don't need to be saved, since they can be computed from other information (_next_index = max(ident_to_idx.values() + 1, for example). And some info is redundant (lineage_to_lid can be computed from lid_to_lineage).

We are still missing a proper json schema for what fields are in each index (hopefully in #578?), but here is a list of the keys that should be present (using a LCA DB from the test data):

$ jq keys <(zcat tests/test-data/prot/protein.lca.json.gz)  [
  "hashval_to_idx",
  "ident_to_idx",
  "ident_to_name", 
  "idx_to_lid",
  "ksize",
  "license",
  "lid_to_lineage",
  "moltype",  "scaled",
  "type",
  "version"

What I interpret from this is either my serialize function is wrong, or the dictionaries in python are not transferring to the HashMaps in rust.

While the load method in Python should be able to load data saved in Rust (or Rust load the data saved in Python), I wouldn't worry too much in making them interoperate (because only the Rust one will be left after this PR).

Maybe it's easier to write some tests on the Rust side that "roundtrip" an existing LCA DB. By roundtrip I mean loading a .lca.json.gz file, saving it, loading again (in another variable) and comparing if you get the same data.

Something like this (put this at the end of src/core/src/index/lca_db.rs):

#[cfg(test)]                                                                                   
mod test {                                                                                     
    use std::fs::File;
    use std::io::{Seek, SeekFrom};
    use std::path::PathBuf;

    use super::LcaDB;

    #[test]
    fn lca_roundtrip() {
        let mut filename = PathBuf::from(env!("CARGO_MANIFEST_DIR"));
        filename.push("../../tests/test-data/lca/delmont-1.lca.json");

        let lcadb = LcaDB::load(filename).unwrap();

        let mut tmpfile = tempfile::NamedTempFile::new().unwrap();
        lcadb.save(tmpfile.path()).unwrap();

        tmpfile.seek(SeekFrom::Start(0)).unwrap();

        let lcadb_2 = LcaDB::load(tmpfile.path()).unwrap();

        assert_eq!(lcadb.ksize, lcadb_2.ksize);
        assert_eq!(lcadb.scaled, lcadb_2.scaled);
        assert_eq!(lcadb.moltype, lcadb_2.moltype);
        assert_eq!(lcadb.idx_to_lid, lcadb_2.idx_to_lid);
        assert_eq!(lcadb.hashval_to_idx, lcadb_2.hashval_to_idx);
    }
}

(this is copied from the SBT tests, might need some adaptation...)
Also note that putting tests in src/core/src/index/lca_db.rs allow accessing private fields of the struct, which is good for testing behavior that you don't necessarily want exposed in the public interface (you could check if _next_idx is the same, for example, without making it pub)

I've scoured the internet and other structures you have already built in rust but haven't been able to make anything work. Is there something I'm missing?

Because we need to invert lid_to_lineage to generate lineage_to_lid, having something that can be used as a key in a HashMap is useful. So, representing lineages with a HashMap doesn't work (because HashMap is not hashable), but BTreeMap is. I suggest using

type LineagePairs = BTreeMap<String, String>; 

and then define lid_to_lineage and lineage_to_lid as

lid_to_lineage: HashMap<u32, LineagePairs>,
lineage_to_lid: HashMap<LineagePairs, u32>,  

Also passing the lineage parameter in the insert(...) function in lca_db.py has been giving me some grief. I think it is because it is a bunch of LineagePair tuples which contain strings.

Yeah, that's going to be annoying... I think the best example is save_signatures in the Python side and signatures_save_buffer in the Rust side.
You pretty much will have to build a pointer-of-pointers in the Python side, and rebuild it in the Rust side with the appropriate type.

That said, I suggest again defining some tests on the Rust side that call .insert and pass a Lineage, and once they work in Rust then try to expose to Python and make the Python tests work. I tend to have a hard time if I need to think on all the layers at the same time, and once I have something working in the Rust side it is easier to go back and refactor to fit better what is expected in the FFI...

Would the best thing to do to be able to pass this as an argument be to convert all of these strings to bytes? Depending on how long the list of tuples there are I would assume that would take quite a bit of time to do in Python but maybe not :)

That may be a good idea! It will probably be easier to pass a JSON (as bytes) as the lineage parameter, instead of building structs and figuring out many string allocations.

I hope this is helpful, and please continue asking great questions =]

@@ -291,6 +309,7 @@ def save(self, db_name):
save_d['idx_to_lid'] = self.idx_to_lid
save_d['lid_to_lineage'] = self.lid_to_lineage

print("\n\nPYTHON:\n", save_d, "\n\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to compare the JSON generated by both, it might be better to use json.dumps here, print the generated buffer, and then write the buffer to fp. At this point save_d is still a Python dict, so while it is comparable it is not exactly JSON yet.

use crate::sketch::minhash::{KmerMinHash, HashFunctions, max_hash_for_scaled};
use crate::signature::{Signature, SigsTrait};
use crate::Error;
use crate::ffi::lca_db::AcceptedLineagePair;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing things from crate::ffi is usually a sign that whatever is defined there should probably be here instead. (Same with std::ffi and std::os::raw::c_char, they should rarely be used outside crate::ffi)

}
}

pub fn c_char_to_string(char_ptr: *const c_char) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a free function (not really part of LcaDB, as it doesn't take self as first parameter).

(and this conversion should be happening in crate::ffi too =])

Makefile Outdated Show resolved Hide resolved
@erikyoung85
Copy link
Contributor Author

Thank you so much that helped a ton especially the passing lineage as a json idea!! I have gotten insert working on the rust side now and am just trying to make the ffi work. I have successfully put lineage into a json string, passed it as a parameter, and converted it to a LineagePairs BTreeMap on the rust side. The problem is, when it returns to python after executing the rust insert function, all the changes to self have been erased. Is there some sort of conversion function that I need to implement or something?

@luizirber
Copy link
Member

Thank you so much that helped a ton especially the passing lineage as a json idea!! I have gotten insert working on the rust side now and am just trying to make the ffi work.

🎉

I have successfully put lineage into a json string, passed it as a parameter, and converted it to a LineagePairs BTreeMap on the rust side. The problem is, when it returns to python after executing the rust insert function, all the changes to self have been erased. Is there some sort of conversion function that I need to implement or something?

Hmm, can you push your code so I can take a look? The changes shouldn't be erased...

@erikyoung85
Copy link
Contributor Author

Thanks for helping, this is the test I was using to get my print and println! outputs: py.test tests/test_lca.py::test_summarize_to_root

@luizirber
Copy link
Member

Thanks for helping, this is the test I was using to get my print and println! outputs: py.test tests/test_lca.py::test_summarize_to_root

Before I go into the code, do you know about dbg! in Rust? It's pretty useful, since it also prints the line where it is

sourmash/lca/lca_db.py Outdated Show resolved Hide resolved
@erikyoung85
Copy link
Contributor Author

erikyoung85 commented Aug 8, 2020

I did not know about that but that's cool especially that it formats it for you too! I keep finding new useful stuff about rust like everyday.

@luizirber
Copy link
Member

I did not know about that but that's cool! I keep finding new useful stuff about rust like everyday. Question: is dbg! similar to the debug function in python? I've seen that one around also.

dbg! comes from the Rust stdlib, debug is from our internal logging module. I think the closest equivalent to debug in Rust would be the debug! macro from the log crate.

@luizirber
Copy link
Member

luizirber commented Aug 26, 2020

(note that you want to merge the latest branch changes, not master =])

@erikyoung85
Copy link
Contributor Author

(note that you want to merge the latest branch changes, not master =])

Ah, thank you :)

@erikyoung85
Copy link
Contributor Author

So I'm trying to set up a testing environment and I'm trying to use the sourmash resources that were linked in your blog post... however I keep getting this error when I try to make or snakemake:

(sourmash_resources) erik@MacBook-Pro sourmash_resources % make
snakemake --use-conda -j1
Building DAG of jobs...
Nothing to be done.
Password:
tee: /sys/devices/system/cpu/cpufreq/boost: No such file or directory
1
sudo: cpupower: command not found
CalledProcessError in line 204 of /Users/erik/Desktop/Sourmash/sourmash_resources/Snakefile:
Command 'set -euo pipefail;  scripts/cpu_freq_benchmark' returned non-zero exit status 1.
  File "/Users/erik/Desktop/Sourmash/sourmash_resources/Snakefile", line 204, in __onstart
make: *** [all] Error 1

I'm using the Conda environment created with the environment.yml file.
Any idea of what this means? Haven't found anything on the internet but I will keep searching/trying stuff.

@luizirber
Copy link
Member

So I'm trying to set up a testing environment and I'm trying to use the sourmash resources that were linked in your blog post... however I keep getting this error when I try to make or snakemake:

I'm using the Conda environment created with the environment.yml file.
Any idea of what this means? Haven't found anything on the internet but I will keep searching/trying stuff.

Ah, remove these lines at the bottom of the Snakefile. I use them to set my CPU (AMD) to maximum frequency, to avoid variations due to frequency scaling.

I highly recommend replacing this line with a smaller selection of branches (maybe just 3.5.0, latest and oxidize_lca_db). Note that you need to create environments in envs/ to match branch names (it would be so cool to have dynamic conda envs in snakemake...), for 3.5.0 (which is released on bioconda) you can check the environment for 3.3.1 as an example, and for latest and oxidize_lca_db you can check the master env.

(these should probably be instructions in the README of that repo 😓)

@erikyoung85
Copy link
Contributor Author

It seems like its a lot faster with the hashval_to_idx exposed as a cached property instead of making a bunch small exposed functions to replace code that uses hashval_to_idx. Is there another option that you have used before? Or anything else that you can think of?

Ps. I think the 4 failed tests that are failing are coming from the latest branch merge. It looks to me like they have nothing to do with the code in oxidize_lca_db but maybe im wrong :)

search.svg Outdated Show resolved Hide resolved
sourmash/lca/command_index.py Outdated Show resolved Hide resolved
lineage = lca_db.lid_to_lineage[lid]
assignments[hashval].add(lineage)
lineage = lca_db._get_lineage_from_idx(idx)
assignments[hashval].add(lineage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens here if there is no lineage?

sourmash/signature.py Outdated Show resolved Hide resolved
tests/test-data/search.svg Outdated Show resolved Hide resolved
@@ -792,6 +792,43 @@ impl KmerMinHash {
Ok(new_mh)
}

pub fn downsample_scaled(&self, new_scaled: u64) -> Result<KmerMinHash, Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this might be something we want to spread throughout the code base after this. maybe create an issue?

@ctb
Copy link
Contributor

ctb commented Sep 16, 2020

hi @erikyoung85 looks like there is still some cleanup to do (removing the search.svg files, for example). Happy to review after that!

@erikyoung85 erikyoung85 requested a review from ctb September 16, 2020 16:53
@erikyoung85
Copy link
Contributor Author

Whoops looks like it wasn't tracking those deletions and I didn't notice. Doing it directly on GitHub worked fine :). Thanks for letting me know!

@ctb
Copy link
Contributor

ctb commented Apr 21, 2022

closing in favor of #1808.

@ctb ctb closed this Apr 21, 2022
@ctb ctb deleted the oxidize_lca_db branch August 20, 2022 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Oxidize parts of LCA_Database?
3 participants