You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to figure out how to merge the read counts in the outputs from centrifuger-kreport. I’ve separated the report by taxonomic level and now I’m trying to merge the counts across samples. I believe these are the column IDs for the kreport format.
root_relAbund
root_fragCount
direct_fragCount
rank_code
taxID
scientific_name
I thought I could use the scientific_name, but I’m realizing that these are not unique and there are multiple taxIDs per scientific name. I am trying to decide if it would be reasonable take the sum of root_fragCounts for all rows with the same scientific name. For example, here are the first 10 matches of Bacillota_I using the gtdb.r220 index.
The first taxID recruits most of the fragments but there are many fragments mapping to the other taxIDs with the same scientific name.
0.34 64338 0 P 10316302 Bacillota_I
0.01 1877 0 P 10315291 Bacillota_I
0.01 1609 0 P 10316628 Bacillota_I
0.00 870 0 P 10055357 Bacillota_I
0.00 857 0 P 10077854 Bacillota_I
0.00 853 0 P 10085662 Bacillota_I
0.00 587 0 P 10315035 Bacillota_I
0.00 273 0 P 10257521 Bacillota_I
0.00 239 0 P 10221036 Bacillota_I
0.00 239 0 P 10223213 Bacillota_I
Thank you for your help!
The text was updated successfully, but these errors were encountered:
Thank you for finding this issue! In GTDB, the scientific name should be unique and show up only once at each taxonomy rank in the taxonomy tree. I have found the bug and fixed the code, and am rebuilding the index now. Meanwhile, taking sum based on the scientific name is a good solution. I think this issue will also affect Centrifuger's accuracy for multi-classified reads.
Thank you for the testing! I have also updated the index on the dropbox. Only the ".2.cfr" (for taxonomy tree) and ".4.cfr" (index building information) files need to be updated.
I am trying to figure out how to merge the read counts in the outputs from centrifuger-kreport. I’ve separated the report by taxonomic level and now I’m trying to merge the counts across samples. I believe these are the column IDs for the kreport format.
I thought I could use the scientific_name, but I’m realizing that these are not unique and there are multiple taxIDs per scientific name. I am trying to decide if it would be reasonable take the sum of root_fragCounts for all rows with the same scientific name. For example, here are the first 10 matches of Bacillota_I using the gtdb.r220 index.
The first taxID recruits most of the fragments but there are many fragments mapping to the other taxIDs with the same scientific name.
0.34 64338 0 P 10316302 Bacillota_I
0.01 1877 0 P 10315291 Bacillota_I
0.01 1609 0 P 10316628 Bacillota_I
0.00 870 0 P 10055357 Bacillota_I
0.00 857 0 P 10077854 Bacillota_I
0.00 853 0 P 10085662 Bacillota_I
0.00 587 0 P 10315035 Bacillota_I
0.00 273 0 P 10257521 Bacillota_I
0.00 239 0 P 10221036 Bacillota_I
0.00 239 0 P 10223213 Bacillota_I
Thank you for your help!
The text was updated successfully, but these errors were encountered: