Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into rgfa2
Browse files Browse the repository at this point in the history
  • Loading branch information
glennhickey committed Nov 21, 2023
2 parents 55e2770 + ab6acf2 commit d30c5c2
Show file tree
Hide file tree
Showing 36 changed files with 413 additions and 203 deletions.
1 change: 1 addition & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ test-job:
- export ASAN_OPTIONS="detect_leaks=0"
- CGL_DEBUG=ultra make -j 8
- CACTUS_BINARIES_MODE=local SON_TRACE_DATASETS=$(pwd)/cactusTestData CACTUS_TEST_CHOICE=normal make test
- pip install -U newick
- make -j 8 hal_test
# rebuild without all the debug flags
- make clean
Expand Down
12 changes: 6 additions & 6 deletions BIN-INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,18 @@ pre-compile binary, static linked distribution.
## Extracting
If you have not already extract the distribution and cd into the cactus directory:
```
tar -xzf cactus-bin-v2.6.8.tar.gz
cd cactus-bin-v2.6.8
tar -xzf cactus-bin-v2.6.13.tar.gz
cd cactus-bin-v2.6.13
```

## Setup

To build a python virtualenv and activate, do the following steps. This requires Python version >= 3.7 (so Ubuntu 18.04 users should use `-p python3.8` below):
```
virtualenv -p python3 venv-cactus-v2.6.8
printf "export PATH=$(pwd)/bin:\$PATH\nexport PYTHONPATH=$(pwd)/lib:\$PYTHONPATH\n" >> venv-cactus-v2.6.8/bin/activate
source venv-cactus-v2.6.8/bin/activate
python3 -m pip install -U setuptools pip
virtualenv -p python3 venv-cactus-v2.6.13
printf "export PATH=$(pwd)/bin:\$PATH\nexport PYTHONPATH=$(pwd)/lib:\$PYTHONPATH\n" >> venv-cactus-v2.6.13/bin/activate
source venv-cactus-v2.6.13/bin/activate
python3 -m pip install -U setuptools pip wheel
python3 -m pip install -U .
python3 -m pip install -U -r ./toil-requirement.txt
```
Expand Down
3 changes: 3 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,9 @@ RUN chmod 777 /opt/cactus/wrapper.sh
# log the memory usage (with --realTimeLogging) for local commands
ENV CACTUS_LOG_MEMORY 1

# remember we're in a docker to help with error handling
ENV CACTUS_INSIDE_CONTAINER 1

# UCSC convention is to work in /data
RUN mkdir -p /data
WORKDIR /data
Expand Down
21 changes: 2 additions & 19 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -92,24 +92,6 @@ unitTests = \

#paffyTests \ # This is removed for now

# these are slow, but added to CI here since hal no longer has its own
halTests = \
hal4dExtractTest \
halAlignmentTreesTest \
halBottomSegmentTest \
halColumnIteratorTest \
halGappedSegmentIteratorTest \
halGenomeTest \
halHdf5Tests \
halLiftoverTests \
halMafTests \
halMappedSegmentTest \
halMetaDataTest \
halRearrangementTest \
halSequenceTest \
halTopSegmentTest \
halValidateTest

# if running travis or gitlab, we want output to go to stdout/stderr so it can
# be seen in the log file, as opposed to individual files, which are much
# easier to read when running tests in parallel.
Expand All @@ -133,7 +115,8 @@ testLogDir = ${testOutDir}/logs
test: ${testModules:%=%_runtest} ${unitTests:%=%_run_unit_test}
test_blast: ${testModules:%=%_runtest_blast}
test_nonblast: ${testModules:%=%_runtest_nonblast}
hal_test: ${halTests:%=%_run_unit_test}
hal_test:
cd ${CWD}/submodules/hal && make test

# run one test and save output
%_runtest: ${versionPy}
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,14 @@ virtualenv -p python3 cactus_env
echo "export PATH=$(pwd)/bin:\$PATH" >> cactus_env/bin/activate
echo "export PYTHONPATH=$(pwd)/lib:\$PYTHONPATH" >> cactus_env/bin/activate
source cactus_env/bin/activate
python3 -m pip install -U setuptools pip
python3 -m pip install -U setuptools pip wheel
python3 -m pip install -U .
python3 -m pip install -U -r ./toil-requirement.txt
```

If you have Docker installed, you can now run Cactus. All binaries, such as `lastz` and `cactus-consolidated` will be run via Docker. Singularity binaries can be used in place of docker binaries with the `--binariesMode singularity` flag. Note, you must use Singularity 2.3 - 2.6 or Singularity 3.1.0+. Singularity 3 versions below 3.1.0 are incompatible with cactus (see [issue #55](https://github.com/ComparativeGenomicsToolkit/cactus/issues/55) and [issue #60](https://github.com/ComparativeGenomicsToolkit/cactus/issues/60)).

By default, cactus will use the image, `quay.io/comparative-genomics-toolkit/cactus:<CACTUS_COMMIT>` when running binaries. This is usually okay, but can be overridden with the `CACTUS_DOCKER_ORG` and `CACTUS_DOCKER_TAG` environment variables. For example, to use GPU release 2.4.4, run `export CACTUS_DOCKER_TAG=v2.4.4-gpu` before running cactus.
By default, cactus will use the image corresponding to the latest release when running docker binaries. This is usually okay, but can be overridden with the `CACTUS_DOCKER_ORG` and `CACTUS_DOCKER_TAG` environment variables. For example, to use GPU release 2.4.4, run `export CACTUS_DOCKER_TAG=v2.4.4-gpu` before running cactus.

### Compiling Binaries Locally
In order to compile the binaries locally and not use a Docker image, you need some dependencies installed. On Ubuntu (we've tested on 20.04 and 22.04), you can look at the [Cactus Dockerfile](./Dockerfile) for guidance. To obtain the `apt-get` command:
Expand Down
48 changes: 48 additions & 0 deletions ReleaseNotes.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,51 @@
# Release 2.6.13 2023-11-15

This release fixes an issue where Toil can ask for way too much memory for minigraph construction
- Cut default minigraph construction memory estimate by half
- Add `--mgMemory` option to override minigraph construction memory estimate no matter what
- Exit with a clear error message (instead of more cryptic crash) when user tries to run container binaries in a container
- Fix double Toil delete that seems to cause fatal error in some environments
- Fix `gfaffix` regular expression bug that could cause paths other than the reference to be protoected from collapse.

# Release 2.6.12 2023-11-07

The release contains fixes some recent regressions:

- Include more portable (at least on Ubuntu) `gfaffix` binary.
- Fix error where gpu support on singularity is completely broken.
- Fix `export_hal` and `export_vg` job memory estimates when `--consMemory` not provided.

# Release 2.6.11 2023-10-31

This release fixes a bug introduced in v2.6.10 that prevents diploid samples from working with `cactus-pangenome`

- Remove stray `assert False` from diploid mash distance that was accidentally included in previous release

# Release 2.6.10 2023-10-30

This release contains bug fixes for MAF export and the pangenome pipeline

- Patch `taffy` to fix bug where sometimes length fields in output MAF can be wrong when using `cactus-hal2maf --filterGapCausingDupes`
- Fix regression `cactus-graphmap-split / cactus-pangenome` so that small poorly aligned reference contigs (like some tiny unplaced GRCh38 contigs) do not get unintentionally filtered out. These contigs do not help the graph in any way, but the tool should do what it says and make a component for every single reference contig no matter what, which it is now fixed to do.
- Cactus will now terminate with a clear error message if any other `--batchSystem` than `single_machine` is attempted from *inside* a docker container.
- Mash distance order into `minigraph` construction fixed so that haplotypes from the same sample are always added contiguously in the presence of ties.
- CI fixed to run all `hal` tests, and not just a subset.
- `pip install wheel` added to installation instructions, as apparently that's needed to install Cactus with some (newer?) Pythons.

# Release 2.6.9 2023-10-20

This release contains some bug fixes and changes to docker image uploads

- GFAffix updated to latest release
- CI no longer pushes a docker image to quay.io for every single commit.
- CPU docker release now made locally as done for GPU
- `--binariesMode docker` will automatically point to release image (using GPU one as appropriate)
- `--consMemory` overrides `hal2vg` memory as well
- `--defaulMemory` defaults to `4Gi` when using docker binaries
- SegAlign job memory specification increased to something more realistic
- `--lastzMemory` option added to override SegAlign memory -- highly recommended on SLURM
- chromosome (.vg / .og) outputs from pangenome pipeline will have ref paths of form `GRCh38#0#chr1` instead of `GRCh38#chr1` to be more consistent with full-genome indexes (and PanSN in general)

# Release 2.6.8 2023-09-28

This release includes several bug fixes for the pangenome pipeline
Expand Down
2 changes: 1 addition & 1 deletion api/tests/cactusParamsTest.c
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ static void testCactusParams(CuTest *testCase) {
CuAssertTrue(testCase, length >= 3);
CuAssertIntEquals(testCase, l[0], 2);
CuAssertIntEquals(testCase, l[1], 32);
CuAssertIntEquals(testCase, l[2], 512);
CuAssertIntEquals(testCase, l[2], 256);

// Test moving the root of the search
cactusParams_set_root(p, 1, "caf");
Expand Down
4 changes: 2 additions & 2 deletions build-tools/downloadMafTools
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ export HTSLIB_LIBS="$(pwd)/libhts.a -lbz2 -ldeflate -lm -lpthread -lz -llzma -pt
cd ${mafBuildDir}
git clone https://github.com/ComparativeGenomicsToolkit/taffy.git
cd taffy
git checkout c75ce895b7975e7ac17cb1ce964db3016615de47
git checkout ee50639be3d86451590de8ea4d3a7a037eeaf427
git submodule update --init --recursive
export HALDIR=${CWD}/submodules/hal
make -j ${numcpu}
Expand All @@ -66,7 +66,7 @@ fi
cd ${mafBuildDir}
git clone https://github.com/ComparativeGenomicsToolkit/mafTools.git
cd mafTools
git checkout 837b8f27c7bf781c7cbee3972b94e91aa6a77790
git checkout b88cd313cb18764d87bc801fbbbb00f982c1a48f
find . -name "*.mk" | xargs sed -ie "s/-Werror//g"
find . -name "Makefile*" | xargs sed -ie "s/-Werror//g"
# hack in flags support
Expand Down
14 changes: 7 additions & 7 deletions build-tools/downloadPangenomeTools
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ fi
cd ${pangenomeBuildDir}
git clone https://github.com/ComparativeGenomicsToolkit/cactus-gfa-tools.git
cd cactus-gfa-tools
git checkout 9b26caa961d6e72ad3747e5c2ce81cdf1e9b63c3
git checkout 0c17bc4ae9a7cf174fa40805cde7f8f1f6de8225
make -j ${numcpu}
if [[ $STATIC_CHECK -ne 1 || $(ldd paf2lastz | grep so | wc -l) -eq 0 ]]
then
Expand Down Expand Up @@ -279,7 +279,7 @@ fi
# vg
cd ${pangenomeBuildDir}
#wget -q https://github.com/vgteam/vg/releases/download/v1.51.0/vg
wget -q http://public.gi.ucsc.edu/~hickey/vg-patch/vg.9df2a056197cafd817cf48c76cf662dd775d265d -O vg
wget -q http://public.gi.ucsc.edu/~hickey/vg-patch/vg.98e3b7c867eb64178298535b076189ef7fda5031 -O vg
chmod +x vg
if [[ $STATIC_CHECK -ne 1 || $(ldd vg | grep so | wc -l) -eq 0 ]]
then
Expand All @@ -290,12 +290,12 @@ fi

# gfaffix
cd ${pangenomeBuildDir}
wget -q https://github.com/marschall-lab/GFAffix/releases/download/0.1.5/GFAffix-0.1.5_linux_x86_64.tar.gz
tar xzf GFAffix-0.1.5_linux_x86_64.tar.gz
chmod +x GFAffix-0.1.5_linux_x86_64/gfaffix
if [[ $STATIC_CHECK -ne 1 || $(ldd GFAffix-0.1.5_linux_x86_64/gfaffix | grep so | wc -l) -eq 0 ]]
wget -q https://github.com/marschall-lab/GFAffix/releases/download/0.1.5b/GFAffix-0.1.5b_linux_x86_64.tar.gz
tar xzf GFAffix-0.1.5b_linux_x86_64.tar.gz
chmod +x GFAffix-0.1.5b_linux_x86_64/gfaffix
if [[ $STATIC_CHECK -ne 1 || $(ldd GFAffix-0.1.5b_linux_x86_64/gfaffix | grep so | wc -l) -eq 0 ]]
then
mv GFAffix-0.1.5_linux_x86_64/gfaffix ${binDir}
mv GFAffix-0.1.5b_linux_x86_64/gfaffix ${binDir}
else
exit 1
fi
Expand Down
3 changes: 2 additions & 1 deletion build-tools/makeCpuDockerRelease
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,11 @@ git checkout "${REL_TAG}"
git submodule update --init --recursive

docker build . -f Dockerfile -t ${dockname}:${REL_TAG}
docker tag ${dockname}:${REL_TAG} ${dockname}:latest

read -p "Are you sure you want to push ${dockname}:${REL_TAG} to quay?" yn
case $yn in
[Yy]* ) docker push ${dockname}:${REL_TAG}; break;;
[Yy]* ) docker push ${dockname}:${REL_TAG} && docker push ${dockname}:latest ; break;;
[Nn]* ) exit;;
* ) echo "Please answer yes or no.";;
esac
Expand Down
1 change: 1 addition & 0 deletions doc/pangenome.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ The Minigraph-Cactus pipeline is run via the `cactus-pangenome` command. It cons
**Before running large jobs, it is important to consider the following options:**

* `--mgCores` the number of cores for `minigraph` construction (default: all available)
* `--mgMemory` the amount of memory for `minigraph` construction. The default estimate can be quite conservative (ie high), so if it is too high for your system, you can lower it with this option (default: estimate based on input size).
* `--mapCores` the number of cores for each `minigraph` mapping job (default: up to 6)
* `--consCores` the number of cores for each `cactus-consolidated` job (default: all available)
* `--consMemory` the amount of memory for each `cactus-consolidated` job. By default, it is estimated from the data but these estimates being wrong can be catastrophic on [SLURM](./progressive.md#running-on-a-cluster). Consider setting to the maximum memory you have available when running on a cluster to be extra safe (seems to be more of an issue for non-human data)
Expand Down
21 changes: 18 additions & 3 deletions doc/progressive.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,12 +170,12 @@ The Cactus Docker image contains everything you need to run Cactus (python envir

```
wget -q https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactus/master/examples/evolverMammals.txt -O evolverMammals.txt
docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.8 cactus /data/jobStore /data/evolverMammals.txt /data/evolverMammals.hal
docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.13 cactus /data/jobStore /data/evolverMammals.txt /data/evolverMammals.hal
```

Or you can proceed interactively by running
```
docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.8 bash
docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.13 bash
cactus /data/jobStore /data/evolverMammals.txt /data/evolverMammals.hal
```
Expand Down Expand Up @@ -204,14 +204,24 @@ export TOIL_SLURM_ARGS="--nice=5000"

to avoid making too many enemies.

You can (and probably should) use the `--batchLogsDir` option in order to enable more SLURM logging. You must pass it a directory that already exists. Ex.

```
mkdir -p batch-logs
cactus ./js ./examples/evolverMammals.txt evolverMammals.hal --batchSystem slurm --batchLogsDir batch-logs
```

You'll want to clean out this directory after a successful run.


You cannot run `cactus --batchSystem slurm` from *inside* the Cactus docker container, because the Cactus docker container doesn't contain slurm. Therefore in order to use slurm, you must be able to `pip install` Cactus inside a virtualenv on the head node. You can still use `--binariesMode docker` or `--binariesMode` singularity to run cactus *binaries* from a container, but the Cactus Python module needs to be installed locally.

**IMPORTANT**

To run Progressive Cactus with CPU (default) lastz, you should increase the chunk size. This will divide the input assemblies into fewer pieces, resulting in fewer jobs on the cluster.

```
cp cactus-bin-v2.6.8/src/cactus/cactus_progressive_config.xml ./config-slurm.xml
cp cactus-bin-v2.6.13/src/cactus/cactus_progressive_config.xml ./config-slurm.xml
sed -i config-slurm.xml -e 's/blast chunkSize="30000000"/blast chunkSize="90000000"/g'
sed -i config-slurm.xml -e 's/dechunkBatchSize="1000"/dechunkBatchSize="200"/g'
```
Expand Down Expand Up @@ -340,6 +350,11 @@ We've tested SegAlign on Nvidia V100 and A10G GPUs. See the Terra example above

Please [cite SegAlign](https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00043).

### Using GPU Acceleration on a Cluster

Since `SegAlign` is only released in the GPU-enabled docker image, that's the easiest way to run it. When running on a cluster, this usually means the best way to use it is with `--binariesMode docker --gpu <N>`. This way cactus is installed locally on your virtual environment and can run slurm commands like `sbatch` (that aren't available in the Cactus container), but SegAlign itself will be run from inside Docker.

**Important**: Consider using `--lastzMemory` when using GPU acceleration on a cluster. Like `--consMemory`, it lets you override the amount of memory Toil requests which can help with errors if Cactus's automatic estimate is either too low (cluster evicts the job) or too high (cluster cannot schedule the job).

## Pre-Alignment Checklist

Expand Down
6 changes: 4 additions & 2 deletions examples/evolverPrimates.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
simHuman https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simHuman.chr6
(simOrang:0.00993,((simChimp:0.00272,simHuman:0.00269)cb:0.00415,simGorilla:0.00644)hcb:0.00046);

simOrang https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simOrang.chr6
simChimp https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simChimp.chr6
simHuman https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simHuman.chr6
simGorilla https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simGorilla.chr6
simOrang https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simOrang.chr6
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def run(self):

setup(
name = "Cactus",
version = "2.6.8",
version = "2.6.13",
author = "Benedict Paten",
package_dir = {'': 'src'},
packages = find_packages(where='src'),
Expand Down
10 changes: 7 additions & 3 deletions src/cactus/blast/cactus_blast.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from cactus.shared.common import enableDumpStack
from cactus.shared.common import cactus_override_toil_options
from cactus.shared.version import cactus_commit
from cactus.progressive.cactus_prepare import human2bytesN

from cactus.paf.local_alignment import sanitize_then_make_paf_alignments

Expand Down Expand Up @@ -60,8 +61,11 @@ def main():
parser.add_argument("--binariesMode", choices=["docker", "local", "singularity"],
help="The way to run the Cactus binaries", default=None)
parser.add_argument("--gpu", nargs='?', const='all', default=None, help="toggle on GPU-enabled lastz, and specify number of GPUs (all available if no value provided)")
parser.add_argument("--lastzCores", type=int, default=None, help="Number of cores for each lastz job, only relevant when running with --gpu")

parser.add_argument("--lastzCores", type=int, default=None, help="Number of cores for each lastz/segalign job, only relevant when running with --gpu")
parser.add_argument("--lastzMemory", type=human2bytesN,
help="Memory in bytes for each lastz/segalign job (defaults to an estimate based on the input data size). "
"Standard suffixes like K, Ki, M, Mi, G or Gi are supported (default=bytes))", default=None)

options = parser.parse_args()

setupBinaries(options)
Expand Down Expand Up @@ -95,7 +99,7 @@ def runCactusBlastOnly(options):
# load up the seqfile and figure out the outgroups and schedule
config_node = ET.parse(options.configFile).getroot()
config_wrapper = ConfigWrapper(config_node)
config_wrapper.substituteAllPredefinedConstantsWithLiterals()
config_wrapper.substituteAllPredefinedConstantsWithLiterals(options)
# apply gpu override
config_wrapper.initGPU(options)
mc_tree, input_seq_map, og_candidates = parse_seqfile(options.seqFile, config_wrapper)
Expand Down
10 changes: 7 additions & 3 deletions src/cactus/cactus_progressive_config.xml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,9 @@
<!-- trimOutgroups Remove outgroup sequences that don't have an alignment to an ingroup sequence-->
<!-- outputSecondaryAlignments Include secondary alignments in the output. If included CAF will use these -->
<!-- dechunkBatchSize Parallelize paf_dechunks into batches of at most this size-->
<!-- pickIngroupPrimaryAlignmentsSeparatelyToOutgroups Separately make ingroups pick their primary alignment to
other ingroups without outgroups, then get the outgroups to pick their primary alignment to the ingroups. If 0
get every sequence to pick its primary alignment without regard to if the other sequence is an ingroup or outgroup -->
<blast chunkSize="30000000"
overlapSize="10000"
mapper="lastz"
Expand All @@ -77,10 +80,11 @@
trimIngroups="1"
trimOutgroups="1"
trimMinSize="100"
trimFlanking="10"
trimFlanking="100"
trimOutgroupFlanking="2000"
outputSecondaryAlignments="0"
dechunkBatchSize="1000"
pickIngroupPrimaryAlignmentsSeparatelyToOutgroups="1"
>

<!-- The following are parametrised to produce the same results as the default settings,
Expand Down Expand Up @@ -131,7 +135,7 @@
<!-- minimumBlockHomologySupport TODO-->
<!-- writeInputAlignmentsTo Debug option to write the alignment chains fed to CAF to the specified path. Off by default.-->
<caf
deannealingRounds="2 32 512"
deannealingRounds="2 32 256"
trim="3"
blockTrim="5"
minimumBlockDegree="2"
Expand Down Expand Up @@ -165,7 +169,7 @@
three="1024"
four="512"
five="512"
default="512"
default="256"
/>
</caf>

Expand Down
Loading

0 comments on commit d30c5c2

Please sign in to comment.