-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oxidize lca_db #1131
Oxidize lca_db #1131
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1131 +/- ##
==========================================
- Coverage 84.13% 83.04% -1.10%
==========================================
Files 99 101 +2
Lines 9218 9948 +730
==========================================
+ Hits 7756 8261 +505
- Misses 1462 1687 +225
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This is an amazing start! Extra comment: the cbindgen check failed, I think you wrote the |
Thank you for all the help! Also yes I had just forgotten to run |
(I put the CI check because I was always messing it up and forgetting to update |
I saw you started implementing Moreover, I think it shouldn't be sent across the FFI in general. Something that happens here is that the data is copied and each copy lives in one language, so if it is not copied (and stays on the Rust side) we use half the memory... More concretely, what I'm trying to say is: Don't expose the Python methods starting with Do you agreee @ctb? |
On Mon, Jul 27, 2020 at 02:11:30PM -0700, Luiz Irber wrote:
Do you agreee @ctb?
+1
|
It is also worth pointing out that these methods are exactly the methods defined in the Index abc, as well as the equivalent on the Rust side (the Index trait) There isn't a fully defined and working index in the Rust side, but the BIGSI prototype might be a good place to check (because it is simpler than the SBT variants) |
So, I know my code is kind of all over the place right now, but I am having some trouble. If you run {"ksize":31,"scaled":10000,"filename":"","moltype":"DNA","_next_index":0,"_next_lid":0,"ident_to_name":{},"ident_to_idx":{},"idx_to_lid":{},"lineage_to_lid":0,"lid_to_lineage":{},"hashval_to_idx":{}} While the python save function produces this json: OrderedDict([('version', '2.1'), ('type', 'sourmash_lca'), ('license', 'CC0'), ('ksize', 31), ('scaled', 10000), ('moltype', 'DNA'), ('lid_to_lineage', {0: (LineagePair(rank='superkingdom', name='Bacteria'), LineagePair(rank='phylum', name='Actinobacteria'), LineagePair(rank='class', name='Actinobacteria')), 1: (LineagePair(rank='superkingdom', name='Archaea'), LineagePair(rank='phylum', name='Euryarcheoata'), LineagePair(rank='class', name='unassigned'), LineagePair(rank='order', name='unassigned'), LineagePair(rank='family', name='novelFamily_I'))}), ('hashval_to_idx', {677636055878: [0], 10532770555280: [0], 21323707116253: [0], 23650087715625: [0], 26172103046563: [0], 26416025663777: [0], 32827272301675: [0], 34027997892899: [0], 36621909859455: [0], 38477163578670: [0], 65570422009776: [0], 69948055805402: [0], 72322920475417: [0, 1], 73205278946163: [0], 80487443103162: [0, 1], 86005523509604: [0], 88674429742050: [0], 122275342327700: [0], 122657192016487: [0], 124669419421325: [0], 127379942450416: [0], 135982155721885: [0], 141167198372387: [0], 143233610564679: [0], 146037835579825: [0], 149088436199138: [0], 154296186590415: [0], 163566698849352: [0], 168006515341078: [0], 173613484730321: [0], 175927328346756: [0], 187818979044317: [0], 197497269852505: [0], 225323352564973: [0], 229529589927513: [0], 231592490063953: [0], 246083359950825: [0], 246930056190473: [0], 258171325550743: [0], 266872653019683: [0], 271393622792488: [0], 292237592777531: [0], 294463687174458: [0, 1], 301097385911100: [0], 306907279723793: [0], 315971345465538: [0], 322118929556542: [0, 1], 327926498479880: [0], 327996008288290: [0], 329949136401679: [0, 1], 333501170444664: [0], 340840028907497: [0], 356484201703406: [0], 360520478298478: [0], 373965689901528: [0], 380496788496979: [0], 384971045114358: [0], 408455476549437: [0], 418537032654770: [0], 452668469764051: [0], 454571148747378: [0], 456097783235814: [0], 460967446199335: [0], 464303542748290: [0], 467354489395525: [0], 469368479161635: [0], 484121529768895: [0, 1], 492272432625540: [0], 498431667066412: [0], 503091991500871: [0], 524057246334361: [0], 524169839913687: [0], 554072107807437: [0, 1], 559120046152961: [0, 1], 566268904069704: [0], 568931529038649: [0], 570493161295337: [0], 579322859508178: [0], 585895542804007: [0], 594897004146314: [0, 1], 605809484677014: [0], 606853396955387: [0], 608726514018487: [0], 620285740132375: [0], 628263812008065: [0], 635967823738359: [0], 638041616271469: [0], 653341905911334: [0], 657318659722919: [0], 657693323602493: [0], 672257243390674: [0], 677189579262144: [0], 679765700966803: [0], 688698632142544: [0], 694165516352500: [0], 719640443241057: [0], 723051842737704: [0], 733512402336336: [0], 734277585479279: [0], 737687428617228: [0, 1], 740928265911180: [0], 742732499035659: [0], 749922904334092: [0, 1], 761749290661344: [0], 762080967926036: [0, 1], 777446941363616: [0], 790512741455544: [0], 799943682765610: [0], 826682986140101: [0], 831617511989752: [0], 834125117437721: [0], 845723944231464: [0], 853068169026704: [0], 861633725225086: [0], 869793640345830: [0], 872222674810488: [0], 873141197518956: [0], 875896752893151: [0], 885089476028731: [0], 888795178891357: [0, 1], 904473638200328: [0], 905926272987663: [0], 909715607504390: [0, 1], 910999338164070: [0], 928211917105988: [0], 935996944505539: [0], 950484443151310: [0], 978361670639418: [0], 988698601944372: [0], 988841326130276: [0], 997097293575672: [0], 997886867252960: [0], 1004777540454115: [0], 1010247183930614: [0, 1], 1014647470950684: [0, 1], 1015633919552961: [0], 1021701031915167: [0, 1], 1039974078460375: [0], 1042253594541629: [0], 1044243227196291: [0], 1051254261312122: [0], 1053068789146557: [0], 1068524729760038: [0], 1069259522151029: [0], 1086655873752672: [0], 1098268603781474: [0], 1099460416099821: [0], 1130616772766051: [0, 1], 1132179190652581: [0], 1140460425857898: [0], 1157770929298680: [0], 1177500624327827: [0], 1201583916038072: [0], 1213950438122195: [0], 1218496832319367: [0, 1], 1233697606659939: [0, 1], 1240024106232118: [0], 1260373750402862: [0], 1261298382176817: [0], 1266632373732023: [0], 1297677436162984: [0], 1308937182195042: [0], 1313357506193451: [0], 1314088474185236: [0], 1319215544180508: [0, 1], 1326659730519238: [0], 1332638989279287: [0], 1341489747051762: [0, 1], 1344666390510803: [0], 1358507616166027: [0], 1364526514789664: [0], 1374414410845462: [0], 1379178749359157: [0], 1411795827653081: [0], 1418877090127558: [0], 1422284639820829: [0], 1442391681236064: [0], 1446070912935402: [0], 1451794179742379: [0, 1], 1455176926535326: [0, 1], 1457250282380994: [0, 1], 1471239952699448: [0], 1472018286353104: [0], 1475115102361638: [0, 1], 1485858379436109: [0], 1491132968778757: [0], 1496058912626479: [0], 1514651028551503: [0], 1516231737669559: [0], 1530243797902427: [0, 1], 1538210010050896: [0], 1539188691646823: [0], 1541823434890218: [0], 1544692009598139: [0], 1556646134540043: [0], 1569299506943639: [0], 1577063564886051: [0], 1582327585753236: [0], 1585942565641532: [0], 1588845643148084: [0], 1589471802228022: [0], 1591852458231680: [0], 1595066965487245: [0], 1600603268174083: [0], 1611179808275435: [0], 1617880700081966: [0], 1629175680014346: [0], 1637207156102717: [0], 1644692101510079: [0], 1664361934814586: [0], 1667011068974873: [0], 1675760629311224: [0], 1676162180088113: [0], 1684327880293652: [0], 1687010869501551: [0], 1691473560692150: [0], 1697538954643862: [0], 1700678324584221: [0], 1714715924642011: [0], 1717791883295450: [0], 1718007860258699: [0], 1720151630947142: [0], 1734959730182571: [0], 1747972312176784: [0], 1752607074283337: [0], 1765213439204150: [0], 1772588244577862: [0], 1780614184394756: [0], 1782595500119702: [0], 1790497036442465: [0], 1803697611060216: [0], 1823356122153390: [0], 1826317190915077: [0], 1826910040381509: [0], 6424817699567: [1], 17230694741932: [1], 32421217493157: [1], 45488378809738: [1], 127742693439050: [1], 164517649390910: [1], 201217244473599: [1], 217162288193096: [1], 246762037247108: [1], 262234175368412: [1], 262791161237335: [1], 272505325039917: [1], 277324884532355: [1], 311662175632721: [1], 325215098473953: [1], 387002232314619: [1], 393699253399567: [1], 394768770772846: [1], 466206522727832: [1], 470414528207885: [1], 493893874011097: [1], 527453245826455: [1], 530485938179261: [1], 567229686862002: [1], 585288394533148: [1], 601062183481693: [1], 606163033754391: [1], 608365859974000: [1], 618559876554536: [1], 620744022654408: [1], 624273633519077: [1], 629748800966904: [1], 633336539901794: [1], 647044146011998: [1], 671953628767928: [1], 675747991790996: [1], 693762035805281: [1], 706674702168116: [1], 715624515455477: [1], 756901471752548: [1], 783726333190507: [1], 803651776423768: [1], 809919776498321: [1], 812600135047036: [1], 826810085321987: [1], 844109680777579: [1], 852703926930534: [1], 924584188428876: [1], 928438986469378: [1], 930829744244346: [1], 964792911995463: [1], 982450057864598: [1], 1016575128012012: [1], 1021878034998483: [1], 1025065552182260: [1], 1044109405188955: [1], 1065415986674745: [1], 1078919352502205: [1], 1090666954848274: [1], 1091646104337602: [1], 1093474603859665: [1], 1106791433936686: [1], 1111422879658727: [1], 1130365356387445: [1], 1130766066256384: [1], 1131216747256357: [1], 1161901545764752: [1], 1178273951045225: [1], 1204826369953091: [1], 1223711830342248: [1], 1234577273172292: [1], 1236387731573904: [1], 1261327144420299: [1], 1269832376859503: [1], 1304467121664073: [1], 1319846165007988: [1], 1322204539499908: [1], 1392533349661369: [1], 1469673662960476: [1], 1474383633719149: [1], 1476581670318628: [1], 1556621206408019: [1], 1568529305793924: [1], 1580804078113372: [1], 1604635532663041: [1], 1612129657060372: [1], 1640572833119287: [1], 1645079393158673: [1], 1690300038267201: [1], 1720974846911383: [1], 1725597171425578: [1], 1739165618551272: [1], 1745458554036611: [1], 1746094489285024: [1], 1789484202623589: [1], 1794577466518143: [1], 1809451167939609: [1], 1829011946314518: [1], 1829968336033223: [1]}), ('ident_to_name', {'TARA_MED_MAG_00029': 'TARA_MED_MAG_00029', 'TOBG_MED-875': 'TOBG_MED-875'}), ('ident_to_idx', {'TARA_MED_MAG_00029': 0, 'TOBG_MED-875': 1}), ('idx_to_lid', {0: 0, 1: 1})]) What I interpret from this is either my serialize function is wrong, or the dictionaries in python are not transferring to the HashMaps in rust. I've scoured the internet and other structures you have already built in rust but haven't been able to make anything work. Is there something I'm missing? Also passing the Any and all help is very very welcome |
note here that We are still missing a proper json schema for what fields are in each index (hopefully in #578?), but here is a list of the keys that should be present (using a LCA DB from the test data):
While the Maybe it's easier to write some tests on the Rust side that "roundtrip" an existing LCA DB. By roundtrip I mean loading a Something like this (put this at the end of #[cfg(test)]
mod test {
use std::fs::File;
use std::io::{Seek, SeekFrom};
use std::path::PathBuf;
use super::LcaDB;
#[test]
fn lca_roundtrip() {
let mut filename = PathBuf::from(env!("CARGO_MANIFEST_DIR"));
filename.push("../../tests/test-data/lca/delmont-1.lca.json");
let lcadb = LcaDB::load(filename).unwrap();
let mut tmpfile = tempfile::NamedTempFile::new().unwrap();
lcadb.save(tmpfile.path()).unwrap();
tmpfile.seek(SeekFrom::Start(0)).unwrap();
let lcadb_2 = LcaDB::load(tmpfile.path()).unwrap();
assert_eq!(lcadb.ksize, lcadb_2.ksize);
assert_eq!(lcadb.scaled, lcadb_2.scaled);
assert_eq!(lcadb.moltype, lcadb_2.moltype);
assert_eq!(lcadb.idx_to_lid, lcadb_2.idx_to_lid);
assert_eq!(lcadb.hashval_to_idx, lcadb_2.hashval_to_idx);
}
} (this is copied from the SBT tests, might need some adaptation...)
Because we need to invert type LineagePairs = BTreeMap<String, String>; and then define lid_to_lineage: HashMap<u32, LineagePairs>,
lineage_to_lid: HashMap<LineagePairs, u32>,
Yeah, that's going to be annoying... I think the best example is That said, I suggest again defining some tests on the Rust side that call
That may be a good idea! It will probably be easier to pass a JSON (as bytes) as the lineage parameter, instead of building structs and figuring out many string allocations. I hope this is helpful, and please continue asking great questions =] |
sourmash/lca/lca_db.py
Outdated
@@ -291,6 +309,7 @@ def save(self, db_name): | |||
save_d['idx_to_lid'] = self.idx_to_lid | |||
save_d['lid_to_lineage'] = self.lid_to_lineage | |||
|
|||
print("\n\nPYTHON:\n", save_d, "\n\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to compare the JSON generated by both, it might be better to use json.dumps
here, print the generated buffer, and then write the buffer to fp
. At this point save_d
is still a Python dict, so while it is comparable it is not exactly JSON yet.
src/core/src/index/lca_db.rs
Outdated
use crate::sketch::minhash::{KmerMinHash, HashFunctions, max_hash_for_scaled}; | ||
use crate::signature::{Signature, SigsTrait}; | ||
use crate::Error; | ||
use crate::ffi::lca_db::AcceptedLineagePair; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bringing things from crate::ffi
is usually a sign that whatever is defined there should probably be here instead. (Same with std::ffi
and std::os::raw::c_char
, they should rarely be used outside crate::ffi
)
src/core/src/index/lca_db.rs
Outdated
} | ||
} | ||
|
||
pub fn c_char_to_string(char_ptr: *const c_char) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a free function (not really part of LcaDB
, as it doesn't take self
as first parameter).
(and this conversion should be happening in crate::ffi
too =])
Thank you so much that helped a ton especially the passing |
🎉
Hmm, can you push your code so I can take a look? The changes shouldn't be erased... |
Thanks for helping, this is the test I was using to get my |
Before I go into the code, do you know about |
I did not know about that but that's cool especially that it formats it for you too! I keep finding new useful stuff about rust like everyday. |
|
53a6e5d
to
da9cbca
Compare
(note that you want to merge the |
Ah, thank you :) |
So I'm trying to set up a testing environment and I'm trying to use the sourmash resources that were linked in your blog post... however I keep getting this error when I try to
I'm using the Conda environment created with the |
Ah, remove these lines at the bottom of the Snakefile. I use them to set my CPU (AMD) to maximum frequency, to avoid variations due to frequency scaling. I highly recommend replacing this line with a smaller selection of branches (maybe just (these should probably be instructions in the README of that repo 😓) |
It seems like its a lot faster with the hashval_to_idx exposed as a cached property instead of making a bunch small exposed functions to replace code that uses hashval_to_idx. Is there another option that you have used before? Or anything else that you can think of? Ps. I think the 4 failed tests that are failing are coming from the |
sourmash/lca/command_rankinfo.py
Outdated
lineage = lca_db.lid_to_lineage[lid] | ||
assignments[hashval].add(lineage) | ||
lineage = lca_db._get_lineage_from_idx(idx) | ||
assignments[hashval].add(lineage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens here if there is no lineage?
@@ -792,6 +792,43 @@ impl KmerMinHash { | |||
Ok(new_mh) | |||
} | |||
|
|||
pub fn downsample_scaled(&self, new_scaled: u64) -> Result<KmerMinHash, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this might be something we want to spread throughout the code base after this. maybe create an issue?
…into oxidize_lca_db
hi @erikyoung85 looks like there is still some cleanup to do (removing the search.svg files, for example). Happy to review after that! |
Whoops looks like it wasn't tracking those deletions and I didn't notice. Doing it directly on GitHub worked fine :). Thanks for letting me know! |
closing in favor of #1808. |
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?