Sparse binary relation representations for genome graphs annotations
Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, and André Kahles. Sparse Binary Relation Representations for Genome Graph Annotation. Apr 2020. Journal of Computational Biology, 27(4), 626-639. http://doi.org/10.1089/cmb.2019.0324
This repository implements the following schemes for representing graph annotation:
- Column-major compressed
- Row-major flat
- Rainbowfish
- BinRel-WT
- BRWT
- Multi-BRWT
- ...
This repository is no longer maintained. Check out the MetaGraph project for a significantly optimized and scaled-up implementation of Multi-BRWT as well as many other graph annotation representations.
As an underlying graph structure, the following representations are implemented:
- Hash-based de Bruijn graph
- Complete de Bruijn graph, taking constant space
The figures below show the final size of two compressed binary relations
- Kingsford with 3.7 bln rows and 2,652 columns, density ~0.3%
- Refseq (family) with 1 bln rows and 3,173 columns, density ~3.8%
Method | Kingsford, Gb | RefSeq, Gb |
---|---|---|
Column | 36.6 | 80.2 |
Flat | 41.2 | 121.6 |
BinRel-WT | 49.6 | N/A |
BinRel-WT (sdsl) | 31.4 | 150.6 |
Rainbowfish | 23.2 | 136.6 |
BRWT | 14.1 | 57.2 |
Multi-BRWT | 9.9 | 43.6 |
- cmake 3.6.1
- GNU GCC with C++17 (gcc-8 or higher) or LLVM Clang (clang-7 or higher)
- HTSlib
- boost
- folly (optional)
All can be installed with brew or linuxbrew
brew install gcc autoconf automake libtool cmake make htslib
brew install --build-from-source boost
(optional) brew install --build-from-source double-conversion gflags glog lz4 snappy zstd folly
brew install gcc@8
Then set the environment variables accordingly:
echo "\
# Use gcc-8 with cmake
export CC=\"\$(which gcc-8)\"
export CXX=\"\$(which g++-8)\"
" >> $( [[ "$OSTYPE" == "darwin"* ]] && echo ~/.bash_profile || echo ~/.bashrc )
brew install llvm libomp autoconf automake libtool cmake make htslib boost folly
Then set the environment variables accordingly:
echo "\
# OpenMP
export LDFLAGS=\"\$LDFLAGS -L$(brew --prefix libomp)/lib\"
export CPPFLAGS=\"\$CPPFLAGS -I$(brew --prefix libomp)/include\"
# Clang C++ flags
export LDFLAGS=\"\$LDFLAGS -L$(brew --prefix llvm)/lib -Wl,-rpath,$(brew --prefix llvm)/lib\"
export CPPFLAGS=\"\$CPPFLAGS -I$(brew --prefix llvm)/include\"
export CXXFLAGS=\"\$CXXFLAGS -stdlib=libc++\"
# Path to Clang
export PATH=\"$(brew --prefix llvm)/bin:\$PATH\"
# Use Clang with cmake
export CC=\"\$(which clang)\"
export CXX=\"\$(which clang++)\"
" >> $( [[ "$OSTYPE" == "darwin"* ]] && echo ~/.bash_profile || echo ~/.bashrc )
git clone --recursive https://github.com/ratschlab/genome_graph_annotation
- make sure all submodules are downloaded:
git submodule update --init --recursive
- install third-party libraries from
external-libraries/
following the corresponding istructions
or simply run the following script
git submodule update --init --recursive
pushd external-libraries/sdsl-lite
./install.sh $(pwd)
popd
pushd external-libraries/libmaus2
cmake -DCMAKE_INSTALL_PREFIX:PATH=$(pwd) .
make -j $(($(getconf _NPROCESSORS_ONLN) - 1))
make install
popd
- go to the build directory
mkdir -p build && cd build
- compile by
cmake .. && make -j $(($(getconf _NPROCESSORS_ONLN) - 1))
- run unit tests
./unit_tests
- Linking against dynamic libraries in Anaconda when compiling libmaus2
- make sure that packages like Anaconda are not listed in the exported environment variables
-DCMAKE_BUILD_TYPE=[Debug|Release|Profile]
-- build modes (Release
by default)-DBUILD_STATIC=[ON|OFF]
-- link statically (OFF
by default)-DWITH_AVX=[ON|OFF]
-- compile with support for the avx instructions (ON
by default)
- Build de Bruijn graph from Fasta files, FastQ files, or KMC k-mer counters:
./annograph build
- Annotate graph using the column compressed annotation:
./annograph annotate
- Transform the built annotation to a different annotation scheme:
./annograph transform_anno
- Merge annotations (optional):
./annograph merge_anno
- Query annotated graph
./annograph classify
./annograph build -k 12 -o tiny_example ../tests/data/tiny.fa
./annograph annotate -i tiny_example --anno-filename -o tiny_example ../tests/data/tiny.fa
./annograph classify -i tiny_example -a tiny_example.column.annodbg ../tests/data/tiny.fa
./annograph stats -a tiny_example tiny_example
For real benchmarking scripts, see scripts.
Compressed simulated binary relation matrices can be generated using the script experiments/run_benchmarks.py
. Given a column
count $N_COLUMNS
, the simulation mode $MODE
three available simulation modes are
norepl
uniformly random matrix of size 1,000,000 x$N_COLUMNS
,uniform_rows
200,000 rows of size$N_COLUMNS
duplicated 5 times to form a 1,000,000 x$N_COLUMNS
matrix, anduniform_columns
$N_COLUMNS
/ 5 columns of size 1,000,000 duplicated 5 times to form a 1,000,000 x$N_COLUMNS
matrix.
The compressor $METHOD
can be one of: brwt
, bin_rel_wt
, bin_rel_wt_sdsl
, column
, rbfish
, flat
.
An experiment can then be run with the command
run_benchmarks.py $METHOD $MODE $N_COLUMNS $N_THREADS
when .
is passed in place of $METHOD
and/or $MODE
, all methods/modes are run.
For method brwt
, additional parameters can be passed at the end of the command. These can be one of
--arity <N>
generate BRWT of arityN
, or--greedy 1 --relax <N>
greedy optimization of column arrangement before construction of a BRWT of maximum arityN
All resulting matrices are saved to the simulate
folder in the directory where the script is run.
To reproduce the simulated matrix experiment results from the manuscript, run the following commands
for N_COLUMNS in 500 1000 3000; do
run_benchmarks.py . . $N_COLUMNS $N_THREADS
run_benchmarks.py brwt $N_COLUMNS $N_THREADS --greedy 1 --relax 7
done
To plot all data for the figures in the experiments, run the command
run_benchmarks.py plot $N_COLUMNS
An alternative set of methods can be passed as subsequent arguments if desired, for example
run_benchmarks.py plot $N_COLUMNS brwt_arity_2 brwt_greedy_relax_6 bin_rel_wt