GTDB.r220 taxID vs scientific name #20

shannonmargaret · 2024-11-15T17:57:46Z

I am trying to figure out how to merge the read counts in the outputs from centrifuger-kreport. I’ve separated the report by taxonomic level and now I’m trying to merge the counts across samples. I believe these are the column IDs for the kreport format.

root_relAbund
root_fragCount
direct_fragCount
rank_code
taxID
scientific_name

I thought I could use the scientific_name, but I’m realizing that these are not unique and there are multiple taxIDs per scientific name. I am trying to decide if it would be reasonable take the sum of root_fragCounts for all rows with the same scientific name. For example, here are the first 10 matches of Bacillota_I using the gtdb.r220 index.

The first taxID recruits most of the fragments but there are many fragments mapping to the other taxIDs with the same scientific name.

0.34 64338 0 P 10316302 Bacillota_I
0.01 1877 0 P 10315291 Bacillota_I
0.01 1609 0 P 10316628 Bacillota_I
0.00 870 0 P 10055357 Bacillota_I
0.00 857 0 P 10077854 Bacillota_I
0.00 853 0 P 10085662 Bacillota_I
0.00 587 0 P 10315035 Bacillota_I
0.00 273 0 P 10257521 Bacillota_I
0.00 239 0 P 10221036 Bacillota_I
0.00 239 0 P 10223213 Bacillota_I

Thank you for your help!

mourisl · 2024-11-15T18:47:28Z

Thank you for finding this issue! In GTDB, the scientific name should be unique and show up only once at each taxonomy rank in the taxonomy tree. I have found the bug and fixed the code, and am rebuilding the index now. Meanwhile, taking sum based on the scientific name is a good solution. I think this issue will also affect Centrifuger's accuracy for multi-classified reads.

shannonmargaret · 2024-11-20T17:04:50Z

This is all fixed with the rebuilt GTDB database! Thank you Li for your speedy response and the quick fix!!

mourisl · 2024-11-20T19:30:10Z

Thank you for the testing! I have also updated the index on the dropbox. Only the ".2.cfr" (for taxonomy tree) and ".4.cfr" (index building information) files need to be updated.

mourisl closed this as completed Nov 20, 2024

mourisl mentioned this issue Nov 20, 2024

Fix an issue of creating redundant intermediate nodes in the taxonomy tree #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTDB.r220 taxID vs scientific name #20

GTDB.r220 taxID vs scientific name #20

shannonmargaret commented Nov 15, 2024

mourisl commented Nov 15, 2024

shannonmargaret commented Nov 20, 2024

mourisl commented Nov 20, 2024

GTDB.r220 taxID vs scientific name #20

GTDB.r220 taxID vs scientific name #20

Comments

shannonmargaret commented Nov 15, 2024

mourisl commented Nov 15, 2024

shannonmargaret commented Nov 20, 2024

mourisl commented Nov 20, 2024