Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTDB.r220 taxID vs scientific name #20

Closed
shannonmargaret opened this issue Nov 15, 2024 · 3 comments
Closed

GTDB.r220 taxID vs scientific name #20

shannonmargaret opened this issue Nov 15, 2024 · 3 comments

Comments

@shannonmargaret
Copy link

I am trying to figure out how to merge the read counts in the outputs from centrifuger-kreport. I’ve separated the report by taxonomic level and now I’m trying to merge the counts across samples. I believe these are the column IDs for the kreport format.

  1. root_relAbund
  2. root_fragCount
  3. direct_fragCount
  4. rank_code
  5. taxID
  6. scientific_name

I thought I could use the scientific_name, but I’m realizing that these are not unique and there are multiple taxIDs per scientific name. I am trying to decide if it would be reasonable take the sum of root_fragCounts for all rows with the same scientific name. For example, here are the first 10 matches of Bacillota_I using the gtdb.r220 index.

The first taxID recruits most of the fragments but there are many fragments mapping to the other taxIDs with the same scientific name.

0.34 64338 0 P 10316302 Bacillota_I
0.01 1877 0 P 10315291 Bacillota_I
0.01 1609 0 P 10316628 Bacillota_I
0.00 870 0 P 10055357 Bacillota_I
0.00 857 0 P 10077854 Bacillota_I
0.00 853 0 P 10085662 Bacillota_I
0.00 587 0 P 10315035 Bacillota_I
0.00 273 0 P 10257521 Bacillota_I
0.00 239 0 P 10221036 Bacillota_I
0.00 239 0 P 10223213 Bacillota_I

Thank you for your help!

@mourisl
Copy link
Owner

mourisl commented Nov 15, 2024

Thank you for finding this issue! In GTDB, the scientific name should be unique and show up only once at each taxonomy rank in the taxonomy tree. I have found the bug and fixed the code, and am rebuilding the index now. Meanwhile, taking sum based on the scientific name is a good solution. I think this issue will also affect Centrifuger's accuracy for multi-classified reads.

@shannonmargaret
Copy link
Author

This is all fixed with the rebuilt GTDB database! Thank you Li for your speedy response and the quick fix!!

@mourisl
Copy link
Owner

mourisl commented Nov 20, 2024

Thank you for the testing! I have also updated the index on the dropbox. Only the ".2.cfr" (for taxonomy tree) and ".4.cfr" (index building information) files need to be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants