Wengan

An accurate and ultra-fast genome assembler

Version: 0.2 (18/05/2020)

SYNOPSIS

# Assembling Oxford Nanopore and Illumina reads with WenganM
 wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000

# Assembling PacBio reads and Illumina reads with WenganA
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and BGI reads with WenganM
 wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000

# Hybrid long-read only assembly of PacBio Circular Consensus Sequence and Nanopore data with WenganM
 wengan.pl -x ccsont -a M -l ont.fastq.gz -b ccs.fastq.gz -p asm4 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and Illumina reads with WenganD (need a high memory machine 600GB)
 wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000

# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa

# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa

# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
 wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa

Description

Wengan is a new genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph (SSG). The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by performing a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at different steps of the assembly process. For more information about the algorithmic ideas behind Wengan, please read the preprint available in bioRxiv.

Short-read assembly

Wengan uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, Wengan can use Minia3, Abyss2 or DiscoVarDenovo. The recommended short-read coverage is 50-60X of 2 x 150bp or 2 x 250bp reads.

WenganM [M]

This Wengan mode uses the Minia3 short-read assembler. This is the fastest mode of Wengan and can assemble a complete human genome in less than 210 CPU hours (~50GB of RAM).

WenganA [A]

This Wengan mode uses the Abyss2 short-read assembler. This is the lowest memory mode of Wengan and can assemble a complete human genome with less than 40GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

WenganD [D]

This Wengan mode uses the DiscovarDenovo short-read assembler. This is the greedier memory mode of Wengan and for assembling a complete human genome needs about 600GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.

Long-read presets

The presets define several variables of the Wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.

ontlon

preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb.

ontraw

preset for raw Nanopore reads typically with an N50 ~[15kb-40kb].

pacraw

preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~[8kb-60kb].

pacccs (experimental)

preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~[15kb]. This type of data is not fully supported yet.

Wengan demo

The repository wengan_demo contains a small dataset and instructions to test Wengan v0.2.

#fetch the demo dataset
git clone https://github.com/adigenova/wengan_demo.git

Wengan benchmark

Genome	Long reads	Short reads	Wengan Mode	NG50 (Mb)	CPU (h)	RAM (GB)	Fasta file
		2x150bp 50X (GIAB:rs1 , rs2)	WenganA	23.08	671	45	asm
NA12878	ONT 35X (rel5)	2x150bp 50X (GIAB:rs1 , rs2)	WenganM	16.67	185	53	asm
		2x250bp 60X (ENA:rs1 , rs2)	WenganD	33.13	550	622	asm
HG00073	PAC 90X (ENA:rl1)	2x250bp 63X (ENA:rs1 , rs2)	WenganD	29.2	800	644	asm
NA24385	ONT 60X (GIAB:rl1)	2x250bp 70X (GIAB:rs1)	WenganD	48.8	910	650	asm
CHM13	ONT 50X (T2T:rel2)	2x250bp 66X (ENA:rs1 , rs2)	WenganD	57.4	1027	647	asm

The assemblies generated using Wengan can be downloaded from Zenodo. All the assemblies were ran as described in the Wengan preprint. NG50 was computed using a genome size of 3.14Gb.

Wengan components

A de Bruijn graph assembler (Minia, Abyss or DiscovarDenovo)
FastMIN-SG
IntervalMiss
Liger

Getting the latest source code

Instructions

It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases

Building Wengan from source

To compile Wengan run the following command:

#fetch Wengan and its components
git clone --recursive https://github.com/adigenova/wengan.git wengan

There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin.

Requirements

c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+.

Specific component source code versions used to build Wengan v0.2

abyss commit d4b4b5d
discovarexp-51885 commit f827bab
minia commit 017d23e
fastmin-sg commit 861b061
intervalmiss commit 11be8b42
liger commit 63a044b0
seqtk commit 2efd0c8

Limitations

1.- Genomes larger than 4Gb are not supported yet.

About the name

Wengan is a Mapudungun word. Mapudungun is the language of the Mapuche people, the largest indigenous inhabitants of south-central Chile. Wengan means "Making the path".

Citation

Alex Di Genova, Elena Buena-Atienza, Stephan Ossowski, Marie-France Sagot. Wengan: Efficient and high quality hybrid de novo assembly of human genomes. BioRxiv, link

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
aux_scripts		aux_scripts
components		components
perl		perl
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
wengan-diagram.svg		wengan-diagram.svg
wengan.pl		wengan.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wengan

Version: 0.2 (18/05/2020)

Table of Contents

SYNOPSIS

Description

Short-read assembly

WenganM [M]

WenganA [A]

WenganD [D]

Long-read presets

ontlon

ontraw

pacraw

pacccs (experimental)

Wengan demo

Wengan benchmark

Wengan components

Getting the latest source code

Instructions

Building Wengan from source

Requirements

Specific component source code versions used to build Wengan v0.2

Limitations

About the name

Citation

About

Releases

Packages

Languages

License

sailfish009/wengan

Folders and files

Latest commit

History

Repository files navigation

Wengan

Version: 0.2 (18/05/2020)

Table of Contents

SYNOPSIS

Description

Short-read assembly

WenganM [M]

WenganA [A]

WenganD [D]

Long-read presets

ontlon

ontraw

pacraw

pacccs (experimental)

Wengan demo

Wengan benchmark

Wengan components

Getting the latest source code

Instructions

Building Wengan from source

Requirements

Specific component source code versions used to build Wengan v0.2

Limitations

About the name

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages