Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what the meaning of "no rank", "unclassified", "species" in seqID column of centrifuger classification results? #9

Closed
permia opened this issue May 15, 2024 · 17 comments

Comments

@permia
Copy link

permia commented May 15, 2024

Hello, some reads, e.g. A00289:600:HKWHJDSX2:3:1101:12174:1000, have score, 2ndBestScore, hitLength, queryLength, but the seqID is no rank and the taxID is 0. Why? In my opinion, this read would have been assigned to a genome, but I don't find the seqID of the genome by centrifuger-inspect -x /data/Centrifuger_db/genbank_PRN --summary | grep "no rank\|species".

What's the means of species in seqID? And I also don't find any genome that contain the word species.

readID	seqID	taxID	score	2ndBestScore	hitLength	queryLength	numMatches
A00289:600:HKWHJDSX2:3:1101:2031:1000	JAKKPZ010000723.1	166010	3490	0	112	300	1
A00289:600:HKWHJDSX2:3:1101:3604:1000	JAKKPZ010000038.1	166010	21194	0	275	300	1
A00289:600:HKWHJDSX2:3:1101:5864:1000	JAKKPZ010000723.1	166010	8489	0	153	300	1
A00289:600:HKWHJDSX2:3:1101:8395:1000	JAKKPZ010000723.1	166010	16481	0	242	300	1
A00289:600:HKWHJDSX2:3:1101:10221:1000	JAKKPZ010000001.1	166010	25556	0	256	300	1
A00289:600:HKWHJDSX2:3:1101:12174:1000	no rank	0	25205	25205	298	300	1
A00289:600:HKWHJDSX2:3:1101:16405:1000	JAKKPZ010000011.1	166010	34330	0	292	300	1
A00289:600:HKWHJDSX2:3:1101:4083:1016	no rank	0	64	64	23	300	1
A00289:600:HKWHJDSX2:3:1101:4607:1016	unclassified	0	0	0	0	300	1
A00289:600:HKWHJDSX2:3:1101:5403:1016	species	166010	4849	4849	247	300	1
A00289:600:HKWHJDSX2:3:1101:9579:1016	unclassified	0	0	0	0	300	1

The unclassified value in seqID may mean this read doesn't assiged any genomes.

Besides, the reads with no rank in seqID have a 0 taxID. Does this means this reads would be classifed as unclassified in the centrifuger-kreport. I noticed that the reads with no rank have same vale of the score and 2ndBestScore. Is this means the reads have multiple hits?

@mourisl
Copy link
Owner

mourisl commented May 15, 2024

A read can be mapped to multiple genomes, and Centrifuger by default will report the lowest common ancestor of the taxonomy IDs for the read. You are right that when the score and 2ndBestScore equals, it suggests a multiple-mapped reads. Perhaps the taxonomy tree structure is disjoint, and there lowest common ancestor is 0 as a place holder.

If you want more fine-grained classification results, you can increase the value for the "-k" option, and it will report up to "k" hits for a read.

@permia
Copy link
Author

permia commented May 16, 2024

I re-ran the data with -k 4.

nohup centrifuger \
-t 10 \
-k 4 \
-1 /data5/nematode/rawdata/J_nematode_FRRL220009728-1a_1.clean.fq.gz \
-2 /data5/nematode/rawdata/J_nematode_FRRL220009728-1a_2.clean.fq.gz \
-x /data/Centrifuger_db/genbank_PRN \
2> /data5/nematode/rawdata/genbank_PRN_output_centrifuger.log \
> /data5/nematode/rawdata/genbank_PRN_output.tsv &

Notable lines are marked with #.

Read with no rank and 0 (before) now have the seqID and taxID, e.g. A00289:600:HKWHJDSX2:3:1101:12174:1000
, but some hits also have 0 as the taxID. I'm not sure centrifuger-kreport will handle it properly, that is classifing the reads to taxid 166010 than taxid 0 (I have not done this step yet).

Read with species (before) now have the seqID (accession no.). e.g A00289:600:HKWHJDSX2:3:1101:5403:1016

readID	seqID	taxID	score	2ndBestScore	hitLength	queryLength	numMatches
A00289:600:HKWHJDSX2:3:1101:2031:1000	JAKKPZ010000723.1	166010	3490	0	112	300	1
A00289:600:HKWHJDSX2:3:1101:3604:1000	JAKKPZ010000038.1	166010	21194	0	275	300	1
A00289:600:HKWHJDSX2:3:1101:5864:1000	JAKKPZ010000723.1	166010	8489	0	153	300	1
A00289:600:HKWHJDSX2:3:1101:8395:1000	JAKKPZ010000723.1	166010	16481	0	242	300	1
A00289:600:HKWHJDSX2:3:1101:10221:1000	JAKKPZ010000001.1	166010	25556	0	256	300	1
#A00289:600:HKWHJDSX2:3:1101:12174:1000	JAKKPZ010000003.1	166010	25205	25205	298	300	2
#A00289:600:HKWHJDSX2:3:1101:12174:1000	LSTP01000043.1	0	25205	25205	298	300	2
A00289:600:HKWHJDSX2:3:1101:16405:1000	JAKKPZ010000011.1	166010	34330	0	292	300	1
#A00289:600:HKWHJDSX2:3:1101:4083:1016	OU923107.1	54126	64	64	23	300	3
#A00289:600:HKWHJDSX2:3:1101:4083:1016	OU921494.1	54126	64	64	23	300	3
#A00289:600:HKWHJDSX2:3:1101:4083:1016	OU920625.1	0	64	64	23	300	3
A00289:600:HKWHJDSX2:3:1101:4607:1016	unclassified	0	0	0	0	300	1
#A00289:600:HKWHJDSX2:3:1101:5403:1016	LSTP01000017.1	166010	4849	4849	247	300	2
#A00289:600:HKWHJDSX2:3:1101:5403:1016	JAKKPZ010000011.1	166010	4849	4849	247	300	2
A00289:600:HKWHJDSX2:3:1101:9579:1016	unclassified	0	0	0	0	300	1

@mourisl
Copy link
Owner

mourisl commented May 16, 2024

centrifuger-kreport is based on taxonomy IDs, so all the reads without an effective taxonomy information (taxID 0) will be regarded as unclassified reads.

It's a bit strange though, LSTP01000043.1 should have the taxonomy ID 166010. Could you please check the line
for "LSTP01000043.1" in the seqid_to_taxid.map file?

@permia
Copy link
Author

permia commented May 16, 2024

centrifuger-kreport is based on taxonomy IDs, so all the reads without an effective taxonomy information (taxID 0) will be regarded as unclassified reads.

It's a bit strange though, LSTP01000043.1 should have the taxonomy ID 166010. Could you please check the line for "LSTP01000043.1" in the seqid_to_taxid.map file?

"LSTP01000043.1" is not in seqid_to_taxid.map file. Actually, there are 169695 warnings for unexisted taxid in nohup.out of the centrifuger-build step.
e.g. WARNING: taxonomy id doesn't exist for "LSTP01000043.1"!

The seqID "LSTP01000043.1" belongs to "GCA_001579705.1". And in the "assembly_summary_filtered.txt", the "GCA_001579705.1" line has "166010" as the refseq_category taxid and species_taxid.

@mourisl
Copy link
Owner

mourisl commented May 16, 2024

Those files are indeed not decompressed completely when downloading the files. Do the same warning message show up if you rerun centrifuger-download?

On our server, I did not get any warning message.

@permia
Copy link
Author

permia commented May 17, 2024

Those files are indeed not decompressed completely when downloading the files. Do the same warning message show up if you rerun centrifuger-download?

On our server, I did not get any warning message.

After a second try, there are no warning in centrifuger-download. But there some warning during centrifuger-build.
The first is that taxID is not in the taxonomy tree. I still don't known any reason about it.

Warning: 5 is not in the taxonomy tree
Warning: 2994 is not in the taxonomy tree
Warning: 4126 is not in the taxonomy tree
Warning: 5487 is not in the taxonomy tree
Warning: 626567 is not in the taxonomy tree
Warning: 62654126 is not in the taxonomy tree
Warning: 6010017261 is not in the taxonomy tree
Warning: 6201024933 is not in the taxonomy tree
Warning: 626501000357 is not in the taxonomy tree

The second is taxonomy id doesn't exist for "accession number". The reason is some errors (concatened accesion) in seqid_to_taxid.map.

WARNING: taxonomy id doesn't exist for GL379786.1!
WARNING: taxonomy id doesn't exist for GL379977.1!
WARNING: taxonomy id doesn't exist for GL380367.1!
WARNING: taxonomy id doesn't exist for GL380757.1!
WARNING: taxonomy id doesn't exist for GL381342.1!
WARNING: taxonomy id doesn't exist for GL381732.1!
ommited lines...
WARNING: taxonomy id doesn't exist for CAKKKW010020968.1!
WARNING: taxonomy id doesn't exist for CAKKKW010026733.1!
WARNING: taxonomy id doesn't exist for CAKKKW010028705.1!
WARNING: taxonomy id doesn't exist for CAKKKW010029919.1!
WARNING: taxonomy id doesn't exist for CAKKKW010030677.1!
ommited lines...

Like this

grep "GL379786.1" /data/Centrifuger_db/library/seqid2taxid.map
>CAKKKW01002096GL379786.1    135651

grep "CAKKKW01002096" /data/Centrifuger_db/library/seqid2taxid.map
>CAKKKW010020960.1    54126
>CAKKKW010020961.1    54126
>CAKKKW010020962.1    54126
>CAKKKW010020963.1    54126
>CAKKKW010020964.1    54126
>CAKKKW010020965.1    54126
>CAKKKW010020966.1    54126
>CAKKKW010020967.1    54126
>CAKKKW01002096GL379786.1    135651
>CAKKKW010020969.1    54126

The third waring is "accession no. is filtered due to its short length". Actually, the CAKKKW010020968.1 is 399 bp longer than 11 bp. Seen from the comments #3 (comment), is this because the downloaded genome has been dustmasked?

grep "CAKKKW010020968" /data/Centrifuger_db/nohup.out
>WARNING: taxonomy id doesn't exist for CAKKKW010020968.1!
>WARNING: CAKKKW010020968.1 is filtered due to its short length (could be from masker)!

@mourisl
Copy link
Owner

mourisl commented May 17, 2024

Thank you for providing the detailed output information. I think the concatenated rows may be due to the multiple threading. Each genome file in your case has many entries, so it is more likely to trigger this bug and the format for the mapping file is wrong. Could you please rerun the "centrifuger-download"? (Sorry about that). I'll look into how to solve the racing issue in bash tomorrow. I think Centrifuge may also suffer from this issue. Thank you for reporting these errors!

@mourisl
Copy link
Owner

mourisl commented May 17, 2024

For the short sequences that got filtered, this is because their nucleotide is in the lower case, which is usually due to some masker. Centrifuger directly ignores those lower-case nucleotides.

@mourisl
Copy link
Owner

mourisl commented May 17, 2024

Sorry, I forgot to add rerun the "centrifuger-download" with the option "-P 1"....

@permia
Copy link
Author

permia commented May 17, 2024

Sorry, I forgot to add rerun the "centrifuger-download" with the option "-P 1"....

Thanks. I added "-P 1" in centrifuger-download and centrifuger-build. It's also default behavior. This time, all the running results are normal, except longer runing time and one kind of warning ("is filtered due to its short length (could be from masker)!") in logs of centrifuger-build.

@mourisl
Copy link
Owner

mourisl commented May 17, 2024

The multithreading for centrifuger-build should be fine. I think I've found a solution to the multithreading/multiprocess issue for centrifuger-download. I have updated the code to the centrifuger_download branch. If you are free, please checkout this version and give it a try. I'll do some more tests and update Centrifuger and Centrifuge later. Thank you!

@permia
Copy link
Author

permia commented May 18, 2024

@mourisl I'm still confused about the taxa that appear in the seqID.

Firstly, have a look at the results (-k 4). There are no "no rank" in seqID now but also with some taxa. As noted previously, a read can be mapped to multiple genomes, and Centrifuger by default will report the lowest common ancestor of the taxonomy IDs for the read. However, the species is the smallest taxID, why there are a lot of species in seqID than the seq accession?

# filter the second row without a seqID
awk '$2 ~ /[^0-9]$/' /data5/nematode/rawdata/unmapped_bwa/genbank_PRN_output.tsv \
| awk '{ count[$2]++ } END { for (item in count) print item, count[item] }'

# seqID 1
# genus 7307
# species 1875373
# order 2
# class 6
# unclassified 8981972
# infraorder 316

Is "A00289:600:HKWHJDSX2:3:1101:5855:1047" mapped to multigenome? I used the -k 4, which will report upto 4 distinct, primary assignments for each read pair. However, there is only one records.
But as mentioned previously, when the score and 2ndBestScore equals, it suggests a multiple-mapped reads. The score and 2ndBestScore of "A00289:600:HKWHJDSX2:3:1101:5855:1047" equals.

grep "A00289:600:HKWHJDSX2:3:1101:5855:1047" /data5/nematode/rawdata/unmapped_bwa/genbank_PRN_output.tsv
# A00289:600:HKWHJDSX2:3:1101:5855:1047   species 6238    9245    9245    243     300     1

some example of multiple-mapped reads, e.g. "A00289:600:HKWHJDSX2:3:1101:1226:1078", "A00289:600:HKWHJDSX2:3:1101:25735:1094"

awk '$3 ~ /6238/' /data5/nematode/rawdata/unmapped_bwa/genbank_PRN_output.tsv | head -n 10
# A00289:600:HKWHJDSX2:3:1101:5855:1047   species 6238    9245    9245    243     300     1
# A00289:600:HKWHJDSX2:3:1101:1226:1078   JAYKUC010000053.1       6238    14088   14088   284     300     3
# A00289:600:HKWHJDSX2:3:1101:1226:1078   CP092624.1      6238    14088   14088   284     300     3
# A00289:600:HKWHJDSX2:3:1101:1226:1078   JAYKUB010000025.1       6238    14088   14088   284     300     3
# A00289:600:HKWHJDSX2:3:1101:18475:1078  species 6238    21480   21480   281     300     1
# A00289:600:HKWHJDSX2:3:1101:25735:1094  JAYKUA010000030.1       6238    3385    3385    107     300     2
# A00289:600:HKWHJDSX2:3:1101:25735:1094  JAYKUB010000034.1       6238    3385    3385    107     300     2
# A00289:600:HKWHJDSX2:3:1101:26223:1094  JAYKUA010000030.1       6238    3385    3385    107     300     2
# A00289:600:HKWHJDSX2:3:1101:26223:1094  JAYKUB010000034.1       6238    3385    3385    107     300     2
# A00289:600:HKWHJDSX2:3:1101:27407:1110  JAYKUA010000036.1       6238    22423   22423   301     300     2

taxID 6238 is a species taxon.

grep -w "6238" /data/Centrifuger_db/taxonomy/names.dmp
# 6238    |       Caenorhabditis briggsae AF16    |               |       includes        |
# 6238    |       Caenorhabditis briggsae Dougherty & Nigon, 1949 |               |       authority       |
# 6238    |       Caenorhabditis briggsae |               |       scientific name |

@permia
Copy link
Author

permia commented May 18, 2024

I only download one taxa (29833) of fungi in centrifuger-genome-download. However, the read "A00289:600:HKWHJDSX2:3:1621:11089:36385" gets the order taxID (4892). Why not taxID (29833)?

awk '$2 ~ /\<order\>/' /data5/nematode/rawdata/unmapped_bwa/genbank_PRN_output.tsv
# A00289:600:HKWHJDSX2:3:1621:11089:36385 order   4892    121     121     26      300     4
# A00289:600:HKWHJDSX2:3:2205:27887:14904 order   4892    81      81      24      300     4

grep "A00289:600:HKWHJDSX2:3:1621:11089:36385" /data5/nematode/rawdata/unmapped_bwa/genbank_PRN_output.tsv
# A00289:600:HKWHJDSX2:3:1621:11089:36385 order   4892    121     121     26      300     4
# A00289:600:HKWHJDSX2:3:1621:11089:36385 infraorder      6249    121     121     26      300     4
# A00289:600:HKWHJDSX2:3:1621:11089:36385 infraorder      33283   121     121     26      300     4
# A00289:600:HKWHJDSX2:3:1621:11089:36385 infraorder      2301119 121     121     26      300     4

grep -w "4892" /data/Centrifuger_db/taxonomy/names.dmp
# 4892    |       budding yeasts  |               |       blast name      |
# 4892    |       Endomycetales   |               |       synonym |
# 4892    |       Saccharomycetales       |               |       scientific name |
grep -w "6249" /data/Centrifuger_db/taxonomy/names.dmp
# 6249    |       Ascaridida      |               |       synonym |
# 6249    |       Ascaridomorpha  |               |       scientific name |

grep -w "33283" /data/Centrifuger_db/taxonomy/names.dmp
# 33283   |       Tylenchida      |               |       synonym |
# 33283   |       Tylenchomorpha  |               |       scientific name |

grep -w "2301119" /data/Centrifuger_db/taxonomy/names.dmp
# 2301119 |       Rhabditomorpha  |               |       scientific name |

@mourisl
Copy link
Owner

mourisl commented May 18, 2024

Sorry for the confusion. The "-k 4" means Centrifuger will report up to 4 results for a read. If a read mapped to more than 4 genomes, i.e. 5, Centrifuger will try to merge the results to some ancestor level to reduce the reported items to be no more than 4. More specifically, Centrifuger tries species level, genus level, and so on so forth until the number of entries is no more than -k. If all 5 genomes are strains from the same species, the results will be still be merged to the species level. If 2 of the genomes are under the same species, 3 of them are from another species, I think the reported result will be the two species. Internally, the infraorder and order is on the same rank.

Can you share the sequence of A00289:600:HKWHJDSX2:3:1621:11089:36385? I'll look into the issue why it was not merged into the species level. Thank you!

@permia
Copy link
Author

permia commented May 19, 2024

Sorry for the confusion. The "-k 4" means Centrifuger will report up to 4 results for a read. If a read mapped to more than 4 genomes, i.e. 5, Centrifuger will try to merge the results to some ancestor level to reduce the reported items to be no more than 4. More specifically, Centrifuger tries species level, genus level, and so on so forth until the number of entries is no more than -k. If all 5 genomes are strains from the same species, the results will be still be merged to the species level. If 2 of the genomes are under the same species, 3 of them are from another species, I think the reported result will be the two species. Internally, the infraorder and order is on the same rank.

Can you share the sequence of A00289:600:HKWHJDSX2:3:1621:11089:36385? I'll look into the issue why it was not merged into the species level. Thank you!

Thanks for detailed explanation. One more question about -k 2 to -k 1, why merge the "class" taxon to "no rank"? ( Is it that the taxonomy tree structure is disjoint, and there lowest common ancestor is 0 as a place holder.)

@A00289:600:HKWHJDSX2:3:1621:11089:36385 1:N:0:AGAGTGTACG+TCTCACTTGC
TGGGCGACTGTGGGAGGGTCAGCATCTTCCCAATCCACGGCGATGGAGAGAAGGGGGGGGGGGGGGCGAGCAATTGAAGGGCAGATGGGCACAAACCAACATCCCATAAATACAATATTACACCCCAAACCAATACAACTATACTCCTAT
+
F,,:,FFFFFFFF,F:F:FFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFF:FFF,F,FFF::FF:FF:,F,F,F,,FFFF:,F,FFF:,,,,,,,F:F,,FFF,F,,F,FF:,,FFF,F::FF,:,F,:F,,,F,,F,::,FF,F:,,

@A00289:600:HKWHJDSX2:3:1621:11089:36385 2:N:0:AGAGTGTACG+TCTCACTTGC
CGAGGGACGAGGGACGAGGACGGGAAGAGGAAGAGGAAGAGGAAGAGGAAGAGGAGGAGAGGAGGACTGGTGAGGGGTGGAAGAGTGTCTCGAAGGGACGATCGAGTGCGCCCCTCTGCCCCGCCACTGCTCGGGCGGCGCCCCCCCCCC
+
FFFFFFFFFF,FFFFFFF:FFFFFFFFFF:FF:FFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFF:FFFFFF:FFFFF,FF:,F::,::F,:F,:,F,F,,,,,::,,,:,,:F:,::

-k 8
A00289:600:HKWHJDSX2:3:1621:11089:36385 SRHO01000003.1 29833 121 121 26 300 7
A00289:600:HKWHJDSX2:3:1621:11089:36385 JAJLSB010000003.1 29833 121 121 26 300 7
A00289:600:HKWHJDSX2:3:1621:11089:36385 JAPEZO010000024.1 6239 121 121 26 300 7
A00289:600:HKWHJDSX2:3:1621:11089:36385 LYYD01000171.1 6265 121 121 26 300 7
A00289:600:HKWHJDSX2:3:1621:11089:36385 JAKKQA010001249.1 166010 121 121 26 300 7
A00289:600:HKWHJDSX2:3:1621:11089:36385 CAUEEP010000219.1 2138241 121 121 26 300 7
A00289:600:HKWHJDSX2:3:1621:11089:36385 CAUEEP010000724.1 2138241 121 121 26 300 7

-k 6
A00289:600:HKWHJDSX2:3:1621:11089:36385 species 6239 121 121 26 300 5
A00289:600:HKWHJDSX2:3:1621:11089:36385 species 6265 121 121 26 300 5
A00289:600:HKWHJDSX2:3:1621:11089:36385 species 29833 121 121 26 300 5
A00289:600:HKWHJDSX2:3:1621:11089:36385 species 166010 121 121 26 300 5
A00289:600:HKWHJDSX2:3:1621:11089:36385 species 2138241 121 121 26 300 5

-k 4
A00289:600:HKWHJDSX2:3:1621:11089:36385 order 4892 121 121 26 300 4
A00289:600:HKWHJDSX2:3:1621:11089:36385 infraorder 6249 121 121 26 300 4
A00289:600:HKWHJDSX2:3:1621:11089:36385 infraorder 33283 121 121 26 300 4
A00289:600:HKWHJDSX2:3:1621:11089:36385 infraorder 2301119 121 121 26 300 4

-k 2
A00289:600:HKWHJDSX2:3:1621:11089:36385 class 4891 121 121 26 300 2
A00289:600:HKWHJDSX2:3:1621:11089:36385 class 119089 121 121 26 300 2

-k 1
A00289:600:HKWHJDSX2:3:1621:11089:36385 no rank 0 121 121 26 300 1

@mourisl
Copy link
Owner

mourisl commented May 19, 2024

Thank you for sharing the file. I think this index is still for the fungi,invertebrate together.

For the -k 1 issue, I looked into the taxonomy structure, their lowest common ancestor is 2759 at the superkingdom level (the clade level for the taxonomy ID 33154 is kind of ignored because clade is not a standard taxonomy rank). However, they have ancestors 33208 and 4751 at the kingdom level, which are internally regarded as the same level as the superkingdom level. Therefore, it will try to move to the level above the kingdom/superkingdom level, which leads to no_rank level. Hope this helps.

@permia permia closed this as completed May 20, 2024
@mourisl
Copy link
Owner

mourisl commented Jul 3, 2024

I have updated Centrifuger's behavior in this case in the new release (https://github.com/mourisl/centrifuger/releases/tag/v1.0.4). If the LCA promotes to a no_rank level, the taxonomy ID will be 1 instead of 0 to distinguish from unclassified reads. Thank you for reporting this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants