-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what the meaning of "no rank", "unclassified", "species" in seqID column of centrifuger classification results? #9
Comments
A read can be mapped to multiple genomes, and Centrifuger by default will report the lowest common ancestor of the taxonomy IDs for the read. You are right that when the score and 2ndBestScore equals, it suggests a multiple-mapped reads. Perhaps the taxonomy tree structure is disjoint, and there lowest common ancestor is 0 as a place holder. If you want more fine-grained classification results, you can increase the value for the "-k" option, and it will report up to "k" hits for a read. |
I re-ran the data with
Notable lines are marked with #. Read with no rank and 0 (before) now have the seqID and taxID, e.g. A00289:600:HKWHJDSX2:3:1101:12174:1000 Read with species (before) now have the seqID (accession no.). e.g A00289:600:HKWHJDSX2:3:1101:5403:1016
|
centrifuger-kreport is based on taxonomy IDs, so all the reads without an effective taxonomy information (taxID 0) will be regarded as unclassified reads. It's a bit strange though, LSTP01000043.1 should have the taxonomy ID 166010. Could you please check the line |
"LSTP01000043.1" is not in seqid_to_taxid.map file. Actually, there are 169695 warnings for unexisted taxid in nohup.out of the centrifuger-build step. The seqID "LSTP01000043.1" belongs to "GCA_001579705.1". And in the "assembly_summary_filtered.txt", the "GCA_001579705.1" line has "166010" as the refseq_category taxid and species_taxid. |
Those files are indeed not decompressed completely when downloading the files. Do the same warning message show up if you rerun centrifuger-download? On our server, I did not get any warning message. |
After a second try, there are no warning in centrifuger-download. But there some warning during centrifuger-build.
The second is taxonomy id doesn't exist for "accession number". The reason is some errors (concatened accesion) in seqid_to_taxid.map.
Like this
The third waring is "accession no. is filtered due to its short length". Actually, the CAKKKW010020968.1 is 399 bp longer than 11 bp. Seen from the comments #3 (comment), is this because the downloaded genome has been dustmasked?
|
Thank you for providing the detailed output information. I think the concatenated rows may be due to the multiple threading. Each genome file in your case has many entries, so it is more likely to trigger this bug and the format for the mapping file is wrong. Could you please rerun the "centrifuger-download"? (Sorry about that). I'll look into how to solve the racing issue in bash tomorrow. I think Centrifuge may also suffer from this issue. Thank you for reporting these errors! |
For the short sequences that got filtered, this is because their nucleotide is in the lower case, which is usually due to some masker. Centrifuger directly ignores those lower-case nucleotides. |
Sorry, I forgot to add rerun the "centrifuger-download" with the option "-P 1".... |
Thanks. I added "-P 1" in centrifuger-download and centrifuger-build. It's also default behavior. This time, all the running results are normal, except longer runing time and one kind of warning ("is filtered due to its short length (could be from masker)!") in logs of centrifuger-build. |
The multithreading for centrifuger-build should be fine. I think I've found a solution to the multithreading/multiprocess issue for centrifuger-download. I have updated the code to the centrifuger_download branch. If you are free, please checkout this version and give it a try. I'll do some more tests and update Centrifuger and Centrifuge later. Thank you! |
@mourisl I'm still confused about the taxa that appear in the seqID. Firstly, have a look at the results (-k 4). There are no "no rank" in seqID now but also with some taxa. As noted previously, a read can be mapped to multiple genomes, and Centrifuger by default will report the lowest common ancestor of the taxonomy IDs for the read. However, the species is the smallest taxID, why there are a lot of species in seqID than the seq accession?
Is "A00289:600:HKWHJDSX2:3:1101:5855:1047" mapped to multigenome? I used the
some example of multiple-mapped reads, e.g. "A00289:600:HKWHJDSX2:3:1101:1226:1078", "A00289:600:HKWHJDSX2:3:1101:25735:1094"
taxID 6238 is a species taxon.
|
I only download one taxa (29833) of fungi in centrifuger-genome-download. However, the read "A00289:600:HKWHJDSX2:3:1621:11089:36385" gets the order taxID (4892). Why not taxID (29833)?
|
Sorry for the confusion. The "-k 4" means Centrifuger will report up to 4 results for a read. If a read mapped to more than 4 genomes, i.e. 5, Centrifuger will try to merge the results to some ancestor level to reduce the reported items to be no more than 4. More specifically, Centrifuger tries species level, genus level, and so on so forth until the number of entries is no more than -k. If all 5 genomes are strains from the same species, the results will be still be merged to the species level. If 2 of the genomes are under the same species, 3 of them are from another species, I think the reported result will be the two species. Internally, the infraorder and order is on the same rank. Can you share the sequence of A00289:600:HKWHJDSX2:3:1621:11089:36385? I'll look into the issue why it was not merged into the species level. Thank you! |
Thanks for detailed explanation. One more question about @A00289:600:HKWHJDSX2:3:1621:11089:36385 1:N:0:AGAGTGTACG+TCTCACTTGC @A00289:600:HKWHJDSX2:3:1621:11089:36385 2:N:0:AGAGTGTACG+TCTCACTTGC
|
Thank you for sharing the file. I think this index is still for the fungi,invertebrate together. For the -k 1 issue, I looked into the taxonomy structure, their lowest common ancestor is 2759 at the superkingdom level (the clade level for the taxonomy ID 33154 is kind of ignored because clade is not a standard taxonomy rank). However, they have ancestors 33208 and 4751 at the kingdom level, which are internally regarded as the same level as the superkingdom level. Therefore, it will try to move to the level above the kingdom/superkingdom level, which leads to no_rank level. Hope this helps. |
I have updated Centrifuger's behavior in this case in the new release (https://github.com/mourisl/centrifuger/releases/tag/v1.0.4). If the LCA promotes to a no_rank level, the taxonomy ID will be 1 instead of 0 to distinguish from unclassified reads. Thank you for reporting this issue! |
Hello, some reads, e.g. A00289:600:HKWHJDSX2:3:1101:12174:1000, have score, 2ndBestScore, hitLength, queryLength, but the seqID is no rank and the taxID is 0. Why? In my opinion, this read would have been assigned to a genome, but I don't find the seqID of the genome by
centrifuger-inspect -x /data/Centrifuger_db/genbank_PRN --summary | grep "no rank\|species"
.What's the means of species in seqID? And I also don't find any genome that contain the word species.
The unclassified value in seqID may mean this read doesn't assiged any genomes.
Besides, the reads with no rank in seqID have a 0 taxID. Does this means this reads would be classifed as unclassified in the centrifuger-kreport. I noticed that the reads with no rank have same vale of the score and 2ndBestScore. Is this means the reads have multiple hits?
The text was updated successfully, but these errors were encountered: