Skip to content

An accurate and ultra-fast hybrid genome assembler

License

Notifications You must be signed in to change notification settings

sailfish009/wengan

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HitCount

Wengan

An accurate and ultra-fast genome assembler

Version: 0.2 (18/05/2020)

Table of Contents

SYNOPSIS

# Assembling Oxford Nanopore and Illumina reads with WenganM
 wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000

# Assembling PacBio reads and Illumina reads with WenganA
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and BGI reads with WenganM
 wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000

# Hybrid long-read only assembly of PacBio Circular Consensus Sequence and Nanopore data with WenganM
 wengan.pl -x ccsont -a M -l ont.fastq.gz -b ccs.fastq.gz -p asm4 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and Illumina reads with WenganD (need a high memory machine 600GB)
 wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000

# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa

# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa

# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
 wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa

Description

Wengan is a new genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph (SSG). The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by performing a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at different steps of the assembly process. For more information about the algorithmic ideas behind Wengan, please read the preprint available in bioRxiv.

Short-read assembly

Wengan uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, Wengan can use Minia3, Abyss2 or DiscoVarDenovo. The recommended short-read coverage is 50-60X of 2 x 150bp or 2 x 250bp reads.

WenganM [M]

This Wengan mode uses the Minia3 short-read assembler. This is the fastest mode of Wengan and can assemble a complete human genome in less than 210 CPU hours (~50GB of RAM).

WenganA [A]

This Wengan mode uses the Abyss2 short-read assembler. This is the lowest memory mode of Wengan and can assemble a complete human genome with less than 40GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

WenganD [D]

This Wengan mode uses the DiscovarDenovo short-read assembler. This is the greedier memory mode of Wengan and for assembling a complete human genome needs about 600GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

Long-read presets

The presets define several variables of the Wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.

ontlon

preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb.

ontraw

preset for raw Nanopore reads typically with an N50 ~[15kb-40kb].

pacraw

preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~[8kb-60kb].

pacccs (experimental)

preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~[15kb]. This type of data is not fully supported yet.

Wengan demo

The repository wengan_demo contains a small dataset and instructions to test Wengan v0.2.

#fetch the demo dataset
git clone https://github.com/adigenova/wengan_demo.git

Wengan benchmark

Genome Long reads Short reads Wengan Mode NG50 (Mb) CPU (h) RAM (GB) Fasta file
2x150bp 50X (GIAB:rs1 , rs2) WenganA 23.08 671 45 asm
NA12878 ONT 35X (rel5) 2x150bp 50X (GIAB:rs1 , rs2) WenganM 16.67 185 53 asm
2x250bp 60X (ENA:rs1 , rs2) WenganD 33.13 550 622 asm
HG00073 PAC 90X (ENA:rl1) 2x250bp 63X (ENA:rs1 , rs2) WenganD 29.2 800 644 asm
NA24385 ONT 60X (GIAB:rl1) 2x250bp 70X (GIAB:rs1) WenganD 48.8 910 650 asm
CHM13 ONT 50X (T2T:rel2) 2x250bp 66X (ENA:rs1 , rs2) WenganD 57.4 1027 647 asm

The assemblies generated using Wengan can be downloaded from Zenodo. All the assemblies were ran as described in the Wengan preprint. NG50 was computed using a genome size of 3.14Gb.

Wengan components

Getting the latest source code

Instructions

It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases

Building Wengan from source

To compile Wengan run the following command:

#fetch Wengan and its components
git clone --recursive https://github.com/adigenova/wengan.git wengan

There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin.

Requirements

c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+.

Specific component source code versions used to build Wengan v0.2

  1. abyss commit d4b4b5d
  2. discovarexp-51885 commit f827bab
  3. minia commit 017d23e
  4. fastmin-sg commit 861b061
  5. intervalmiss commit 11be8b42
  6. liger commit 63a044b0
  7. seqtk commit 2efd0c8

Limitations

1.- Genomes larger than 4Gb are not supported yet.

About the name

Wengan is a Mapudungun word. Mapudungun is the language of the Mapuche people, the largest indigenous inhabitants of south-central Chile. Wengan means "Making the path".

Citation

Alex Di Genova, Elena Buena-Atienza, Stephan Ossowski, Marie-France Sagot. Wengan: Efficient and high quality hybrid de novo assembly of human genomes. BioRxiv, link

About

An accurate and ultra-fast hybrid genome assembler

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Perl 100.0%