Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some warning messages during centrifuger-download. "unexpected end of file", "invalid compressed data--format violated", "invalid compressed data--length error" #8

Closed
permia opened this issue May 15, 2024 · 9 comments

Comments

@permia
Copy link

permia commented May 15, 2024

Hi, when I use the following code to download the genome, there are some warning messages in the logs. Is that a problem I should solve? The centrifuger (Centrifuger v1.0.1-r89) was installed via conda.

nohup centrifuger-download \
-P 10 \
-o library \
-d 'fungi,invertebrate' \
-a 'Any' \
-t '54126,166010,6238,2138241,1611254,31234,70226,6239,2598192,135651,29833,289476,2878363,96644,34508,6265' \
genbank \
2> download.log \
> seqid2taxid.map &

The logs (partial):

Downloading https://ftp.ncbi.nlm.nih.gov/genomes/genbank/invertebrate/assembly_summary.txt ...
Downloading 95 invertebrate genomes at assembly level Any ... (will take a while)

gzip: library/invertebrate/GCA_000975215.1_Cael_CB4856_1.0_genomic.fna.gz: unexpected end of file

Progress : [----------------------------------------] 1% 1/95
gzip: library/invertebrate/GCA_029581135.1_ASM2958113v1_genomic.fna.gz: unexpected end of file

Progress : [----------------------------------------] 2% 2/95
gzip: library/invertebrate/GCA_000939815.1_C_elegans_Bristol_N2_v1_5_4_genomic.fna.gz: unexpected end of file

Progress : [#---------------------------------------] 3% 3/95
gzip: library/invertebrate/GCA_037024065.1_ASM3702406v1_genomic.fna.gz: unexpected end of file

Progress : [#---------------------------------------] 4% 4/95
gzip: library/invertebrate/GCA_021491975.1_ASM2149197v1_genomic.fna.gz: unexpected end of file

Progress : [##--------------------------------------] 5% 5/95
gzip: library/invertebrate/GCA_000002985.3_WBcel235_genomic.fna.gz: unexpected end of file

Progress : [##--------------------------------------] 6% 6/95
gzip: library/invertebrate/GCA_022453885.1_ASM2245388v1_genomic.fna.gz: unexpected end of file

Progress : [##--------------------------------------] 7% 7/95
gzip: library/invertebrate/GCA_000004555.3_CB4_genomic.fna.gz: unexpected end of file

Progress : [###-------------------------------------] 8% 8/95
gzip: library/invertebrate/GCA_037024035.1_ASM3702403v1_genomic.fna.gz: unexpected end of file

Progress : [###-------------------------------------] 9% 9/95
gzip: library/invertebrate/GCA_037024025.1_ASM3702402v1_genomic.fna.gz: unexpected end of file

Progress : [####------------------------------------] 10% 10/95
Progress : [----------------------------------------] 1% 1/95
Progress : [----------------------------------------] 2% 2/95
Progress : [#---------------------------------------] 3% 3/95
Progress : [#---------------------------------------] 4% 4/95
Progress : [##--------------------------------------] 5% 5/95
Progress : [##--------------------------------------] 6% 6/95
Progress : [##--------------------------------------] 7% 7/95
Progress : [###-------------------------------------] 8% 8/95
Progress : [###-------------------------------------] 9% 9/95
Progress : [####------------------------------------] 10% 10/95
gzip: library/invertebrate/GCA_016989285.1_JU2526_Canu_genomic.fna.gz: invalid compressed data--format violated

Progress : [####------------------------------------] 11% 11/95
gzip: library/invertebrate/GCA_013403715.1_ASM1340371v1_genomic.fna.gz: invalid compressed data--format violated

Progress : [####------------------------------------] 12% 12/95
gzip: library/invertebrate/GCA_004526295.1_ASM452629v1_genomic.fna.gz: invalid compressed data--format violated

Progress : [#####-----------------------------------] 13% 13/95
gzip: library/invertebrate/GCA_016989115.1_NIC526_Canu_genomic.fna.gz: invalid compressed data--format violated

Progress : [#####-----------------------------------] 14% 14/95
gzip: library/invertebrate/GCA_016989125.1_XZ1516_Canu_genomic.fna.gz: invalid compressed data--format violated

Progress : [######----------------------------------] 15% 15/95
gzip: library/invertebrate/GCA_016989275.1_JU310_Canu_genomic.fna.gz: invalid compressed data--format violated

Progress : [######----------------------------------] 16% 16/95
gzip: library/invertebrate/GCA_900160655.1_spades_ilmn_draft_assembly.fasta.gz_genomic.fna.gz: invalid compressed data--crc error

gzip: library/invertebrate/GCA_900160655.1_spades_ilmn_draft_assembly.fasta.gz_genomic.fna.gz: invalid compressed data--length error

......... (etc)
@mourisl
Copy link
Owner

mourisl commented May 15, 2024

Could you please directly download one of the genomic file such as "curl -s -o test.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/975/215/GCA_000975215.1_Cael_CB4856_1.0/GCA_000975215.1_Cael_CB4856_1.0_genomic.fna.gz" and check whether test.fna.gz is intact?

@permia
Copy link
Author

permia commented May 15, 2024

Could you please directly download one of the genomic file such as "curl -s -o test.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/975/215/GCA_000975215.1_Cael_CB4856_1.0/GCA_000975215.1_Cael_CB4856_1.0_genomic.fna.gz" and check whether test.fna.gz is intact?

$ curl -s -o test.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/975/215/GCA_000975215.1_Cael_CB4856_1.0/GCA_000975215.1_Cael_CB4856_1.0_genomic.fna.gz
$ gunzip -t /data5/nematode/rawdata/unmapped_reads/test.fna.gz
$ echo $?
0

I checked the integrity of the downloaded fna.gz file by gunzip -t. The returned result shown the file is ok. In addition, I also check some of the warning files downloaded by centrifuger-download. It's also ok.

$ gunzip -t /data/Centrifuger_db/library/invertebrate/GCA_000975215.1_Cael_CB4856_1.0_genomic.fna.gz
$ gunzip -t /data/Centrifuger_db/library/invertebrate/GCA_013403715.1_ASM1340371v1_genomic.fna.gz
$ gunzip -t /data/Centrifuger_db/library/invertebrate/GCA_900160655.1_spades_ilmn_draft_assembly.fasta.gz_genomic.fna.gz

@mourisl
Copy link
Owner

mourisl commented May 15, 2024

This is very strange...does the seqid2taxid.map look right? For example, does it have about 109 rows. What is the content for the line containing "CM003206.1"?

@mourisl
Copy link
Owner

mourisl commented May 15, 2024

Oh, I just noticed the other issue with classification results is also from you. I guess this is a strange issue from gzip then...

@permia
Copy link
Author

permia commented May 16, 2024

This is very strange...does the seqid2taxid.map look right? For example, does it have about 109 rows. What is the content for the line containing "CM003206.1"?

Why might the seqid2taxid.map contain about 109 rows?
cat /data/Centrifuger_db/seqid2taxid.map | wc # 514610 1029221 13289800

Find duplicated records in seqid2taxid.map. This could be an issue with the GenBank data itself.
cat /data/Centrifuger_db/seqid2taxid.map | sort | uniq | wc # 345392 690785 8931928

e.g.
cat /data/Centrifuger_db/seqid2taxid.map | grep "JAWIRP010000020.1" | wc # 3215 6430 83590

cat /data/Centrifuger_db/seqid2taxid.map | grep "CM003206" # CM003206.1 6239

@mourisl
Copy link
Owner

mourisl commented May 16, 2024

That's my mistake. I just tried to download some of the vertebrate and fungi genomes for the test. Seems the parsing for the seqid2taxid.map file was correct.

@permia
Copy link
Author

permia commented May 17, 2024

That's my mistake. I just tried to download some of the vertebrate and fungi genomes for the test. Seems the parsing for the seqid2taxid.map file was correct.

I updated the Centrifuger to Centrifuger v1.0.3-r119 owing to no response in centrifuger-download. And the ultimate reason is that there is a maintenance from NCBI yesterday.
I re-download the data with centrifuger-download after the maintenance. Now, there are no warning messages and the download is faster than before.

The seqid2taxid.map file also seems correct.

sort /data/Centrifuger_db/library/seqid2taxid.map |  uniq | wc
>515080 1030166 13289800
sort /data/Centrifuger_db/library/seqid2taxid.map | wc
>515080 1030166 13289800

Could the previous download issue be due to a slow network?

@mourisl
Copy link
Owner

mourisl commented May 17, 2024

I think you are right, the download was slow and interrupted due to the maintenance from NCBI (where did you find the notice? just curious.) Glad it works out today!

@permia
Copy link
Author

permia commented May 17, 2024

I think you are right, the download was slow and interrupted due to the maintenance from NCBI (where did you find the notice? just curious.) Glad it works out today!

The same centrifuger-download script didn't work at the second time. It's distressing. So I updated the centrifuger. It also didn't work yesterday. Finally, I found the Download problem. I stopped try it until the second day. I can download the data now. However, why the page still show that "Notice: Upcoming Maintenance Downtime". It's weird.

I will close this issue. And there are some problem in parsing seqid_to_taxid.map, which will mentioned in #9

@permia permia closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants