Merge remote-tracking branch 'origin/master' into rgfa2

ComparativeGenomicsToolkit · Nov 21, 2023 · d30c5c2 · d30c5c2
2 parents 55e2770 + ab6acf2
commit d30c5c2
Show file tree

Hide file tree

Showing 36 changed files with 413 additions and 203 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -29,6 +29,7 @@ test-job:
     - export ASAN_OPTIONS="detect_leaks=0"
     - CGL_DEBUG=ultra make -j 8
     - CACTUS_BINARIES_MODE=local SON_TRACE_DATASETS=$(pwd)/cactusTestData CACTUS_TEST_CHOICE=normal make test
+    - pip install -U newick    
     - make -j 8 hal_test
     # rebuild without all the debug flags
     - make clean

diff --git a/BIN-INSTALL.md b/BIN-INSTALL.md
@@ -6,18 +6,18 @@ pre-compile binary, static linked distribution.
 ## Extracting
 If you have not already extract the distribution and cd into the cactus directory:
 ```
-tar -xzf cactus-bin-v2.6.8.tar.gz
-cd cactus-bin-v2.6.8
+tar -xzf cactus-bin-v2.6.13.tar.gz
+cd cactus-bin-v2.6.13
 ```
 
 ## Setup
 
 To build a python virtualenv and activate, do the following steps. This requires Python version >= 3.7 (so Ubuntu 18.04 users should use `-p python3.8` below):
 ```
-virtualenv -p python3 venv-cactus-v2.6.8
-printf "export PATH=$(pwd)/bin:\$PATH\nexport PYTHONPATH=$(pwd)/lib:\$PYTHONPATH\n" >> venv-cactus-v2.6.8/bin/activate
-source venv-cactus-v2.6.8/bin/activate
-python3 -m pip install -U setuptools pip
+virtualenv -p python3 venv-cactus-v2.6.13
+printf "export PATH=$(pwd)/bin:\$PATH\nexport PYTHONPATH=$(pwd)/lib:\$PYTHONPATH\n" >> venv-cactus-v2.6.13/bin/activate
+source venv-cactus-v2.6.13/bin/activate
+python3 -m pip install -U setuptools pip wheel
 python3 -m pip install -U .
 python3 -m pip install -U -r ./toil-requirement.txt
 ```

diff --git a/Dockerfile b/Dockerfile
@@ -87,6 +87,9 @@ RUN chmod 777 /opt/cactus/wrapper.sh
 # log the memory usage (with --realTimeLogging) for local commands
 ENV CACTUS_LOG_MEMORY 1
 
+# remember we're in a docker to help with error handling
+ENV CACTUS_INSIDE_CONTAINER 1
+
 # UCSC convention is to work in /data
 RUN mkdir -p /data
 WORKDIR /data

diff --git a/Makefile b/Makefile
@@ -92,24 +92,6 @@ unitTests = \
 
 #paffyTests \ # This is removed for now
 
-# these are slow, but added to CI here since hal no longer has its own
-halTests = \
-	hal4dExtractTest \
-	halAlignmentTreesTest \
-	halBottomSegmentTest \
-	halColumnIteratorTest \
-	halGappedSegmentIteratorTest \
-	halGenomeTest \
-	halHdf5Tests \
-	halLiftoverTests \
-	halMafTests \
-	halMappedSegmentTest \
-	halMetaDataTest \
-	halRearrangementTest \
-	halSequenceTest \
-	halTopSegmentTest \
-	halValidateTest
-
 # if running travis or gitlab, we want output to go to stdout/stderr so it can
 # be seen in the log file, as opposed to individual files, which are much
 # easier to read when running tests in parallel.
@@ -133,7 +115,8 @@ testLogDir = ${testOutDir}/logs
 test: ${testModules:%=%_runtest} ${unitTests:%=%_run_unit_test}
 test_blast: ${testModules:%=%_runtest_blast}
 test_nonblast: ${testModules:%=%_runtest_nonblast}
-hal_test: ${halTests:%=%_run_unit_test}
+hal_test:
+	cd ${CWD}/submodules/hal && make test
 
 # run one test and save output
 %_runtest: ${versionPy}

diff --git a/README.md b/README.md
@@ -54,14 +54,14 @@ virtualenv -p python3 cactus_env
 echo "export PATH=$(pwd)/bin:\$PATH" >> cactus_env/bin/activate
 echo "export PYTHONPATH=$(pwd)/lib:\$PYTHONPATH" >> cactus_env/bin/activate
 source cactus_env/bin/activate
-python3 -m pip install -U setuptools pip
+python3 -m pip install -U setuptools pip wheel
 python3 -m pip install -U .
 python3 -m pip install -U -r ./toil-requirement.txt
 ```
 
 If you have Docker installed, you can now run Cactus.  All binaries, such as `lastz` and `cactus-consolidated` will be run via Docker.  Singularity binaries can be used in place of docker binaries with the `--binariesMode singularity` flag.  Note, you must use Singularity 2.3 - 2.6 or Singularity 3.1.0+. Singularity 3 versions below 3.1.0 are incompatible with cactus (see [issue #55](https://github.com/ComparativeGenomicsToolkit/cactus/issues/55) and [issue #60](https://github.com/ComparativeGenomicsToolkit/cactus/issues/60)).
 
-By default, cactus will use the image, `quay.io/comparative-genomics-toolkit/cactus:<CACTUS_COMMIT>` when running binaries. This is usually okay, but can be overridden with the `CACTUS_DOCKER_ORG` and `CACTUS_DOCKER_TAG` environment variables.  For example, to use GPU release 2.4.4, run `export CACTUS_DOCKER_TAG=v2.4.4-gpu` before running cactus.
+By default, cactus will use the image corresponding to the latest release when running docker binaries. This is usually okay, but can be overridden with the `CACTUS_DOCKER_ORG` and `CACTUS_DOCKER_TAG` environment variables.  For example, to use GPU release 2.4.4, run `export CACTUS_DOCKER_TAG=v2.4.4-gpu` before running cactus.
 
 ### Compiling Binaries Locally
 In order to compile the binaries locally and not use a Docker image, you need some dependencies installed.  On Ubuntu (we've tested on 20.04 and 22.04), you can look at the [Cactus Dockerfile](./Dockerfile) for guidance. To obtain the `apt-get` command:

diff --git a/ReleaseNotes.md b/ReleaseNotes.md
@@ -1,3 +1,51 @@
+# Release 2.6.13 2023-11-15
+
+This release fixes an issue where Toil can ask for way too much memory for minigraph construction
+- Cut default minigraph construction memory estimate by half
+- Add `--mgMemory` option to override minigraph construction memory estimate no matter what
+- Exit with a clear error message (instead of more cryptic crash) when user tries to run container binaries in a container
+- Fix double Toil delete that seems to cause fatal error in some environments
+- Fix `gfaffix` regular expression bug that could cause paths other than the reference to be protoected from collapse.
+
+# Release 2.6.12 2023-11-07
+
+The release contains fixes some recent regressions:
+
+- Include more portable (at least on Ubuntu) `gfaffix` binary.
+- Fix error where gpu support on singularity is completely broken.
+- Fix `export_hal` and `export_vg` job memory estimates when `--consMemory` not provided.
+
+# Release 2.6.11 2023-10-31
+
+This release fixes a bug introduced in v2.6.10 that prevents diploid samples from working with `cactus-pangenome`
+
+- Remove stray `assert False` from diploid mash distance that was accidentally included in previous release
+
+# Release 2.6.10 2023-10-30
+
+This release contains bug fixes for MAF export and the pangenome pipeline
+
+- Patch `taffy` to fix bug where sometimes length fields in output MAF can be wrong when using `cactus-hal2maf --filterGapCausingDupes`
+- Fix regression `cactus-graphmap-split / cactus-pangenome` so that small poorly aligned reference contigs (like some tiny unplaced GRCh38 contigs) do not get unintentionally filtered out. These contigs do not help the graph in any way, but the tool should do what it says and make a component for every single reference contig no matter what, which it is now fixed to do.
+- Cactus will now terminate with a clear error message if any other `--batchSystem` than `single_machine` is attempted from *inside* a docker container.
+- Mash distance order into `minigraph` construction fixed so that haplotypes from the same sample are always added contiguously in the presence of ties.
+- CI fixed to run all `hal` tests, and not just a subset.
+- `pip install wheel` added to installation instructions, as apparently that's needed to install Cactus with some (newer?) Pythons.
+
+# Release 2.6.9 2023-10-20
+
+This release contains some bug fixes and changes to docker image uploads
+
+- GFAffix updated to latest release
+- CI no longer pushes a docker image to quay.io for every single commit.
+- CPU docker release now made locally as done for GPU
+- `--binariesMode docker` will automatically point to release image (using GPU one as appropriate)
+- `--consMemory` overrides `hal2vg` memory as well
+- `--defaulMemory` defaults to `4Gi` when using docker binaries
+- SegAlign job memory specification increased to something more realistic
+- `--lastzMemory` option added to override SegAlign memory -- highly recommended on SLURM
+- chromosome (.vg / .og) outputs from pangenome pipeline will have ref paths of form `GRCh38#0#chr1` instead of `GRCh38#chr1` to be more consistent with full-genome indexes (and PanSN in general)
+
 # Release 2.6.8 2023-09-28
 
 This release includes several bug fixes for the pangenome pipeline

diff --git a/api/tests/cactusParamsTest.c b/api/tests/cactusParamsTest.c
@@ -27,7 +27,7 @@ static void testCactusParams(CuTest *testCase) {
     CuAssertTrue(testCase, length >= 3);
     CuAssertIntEquals(testCase, l[0], 2);
     CuAssertIntEquals(testCase, l[1], 32);
-    CuAssertIntEquals(testCase, l[2], 512);
+    CuAssertIntEquals(testCase, l[2], 256);
 
     // Test moving the root of the search
     cactusParams_set_root(p, 1, "caf");

diff --git a/build-tools/downloadMafTools b/build-tools/downloadMafTools
@@ -51,7 +51,7 @@ export HTSLIB_LIBS="$(pwd)/libhts.a -lbz2 -ldeflate -lm -lpthread -lz -llzma -pt
 cd ${mafBuildDir}
 git clone https://github.com/ComparativeGenomicsToolkit/taffy.git
 cd taffy
-git checkout c75ce895b7975e7ac17cb1ce964db3016615de47
+git checkout ee50639be3d86451590de8ea4d3a7a037eeaf427
 git submodule update --init --recursive
 export HALDIR=${CWD}/submodules/hal
 make -j ${numcpu}
@@ -66,7 +66,7 @@ fi
 cd ${mafBuildDir}
 git clone https://github.com/ComparativeGenomicsToolkit/mafTools.git
 cd mafTools
-git checkout 837b8f27c7bf781c7cbee3972b94e91aa6a77790
+git checkout b88cd313cb18764d87bc801fbbbb00f982c1a48f
 find . -name "*.mk" | xargs sed -ie "s/-Werror//g"
 find . -name "Makefile*" | xargs sed -ie "s/-Werror//g"
 # hack in flags support

diff --git a/build-tools/downloadPangenomeTools b/build-tools/downloadPangenomeTools
@@ -156,7 +156,7 @@ fi
 cd ${pangenomeBuildDir}
 git clone https://github.com/ComparativeGenomicsToolkit/cactus-gfa-tools.git
 cd cactus-gfa-tools
-git checkout 9b26caa961d6e72ad3747e5c2ce81cdf1e9b63c3
+git checkout 0c17bc4ae9a7cf174fa40805cde7f8f1f6de8225
 make -j ${numcpu}
 if [[ $STATIC_CHECK -ne 1 || $(ldd paf2lastz | grep so | wc -l) -eq 0 ]]
 then
@@ -279,7 +279,7 @@ fi
 # vg
 cd ${pangenomeBuildDir}
 #wget -q https://github.com/vgteam/vg/releases/download/v1.51.0/vg
-wget -q http://public.gi.ucsc.edu/~hickey/vg-patch/vg.9df2a056197cafd817cf48c76cf662dd775d265d -O vg
+wget -q http://public.gi.ucsc.edu/~hickey/vg-patch/vg.98e3b7c867eb64178298535b076189ef7fda5031 -O vg
 chmod +x vg
 if [[ $STATIC_CHECK -ne 1 || $(ldd vg | grep so | wc -l) -eq 0 ]]
 then
@@ -290,12 +290,12 @@ fi
 
 # gfaffix
 cd ${pangenomeBuildDir}
-wget -q https://github.com/marschall-lab/GFAffix/releases/download/0.1.5/GFAffix-0.1.5_linux_x86_64.tar.gz
-tar xzf GFAffix-0.1.5_linux_x86_64.tar.gz
-chmod +x GFAffix-0.1.5_linux_x86_64/gfaffix
-if [[ $STATIC_CHECK -ne 1 || $(ldd GFAffix-0.1.5_linux_x86_64/gfaffix | grep so | wc -l) -eq 0 ]]
+wget -q https://github.com/marschall-lab/GFAffix/releases/download/0.1.5b/GFAffix-0.1.5b_linux_x86_64.tar.gz
+tar xzf GFAffix-0.1.5b_linux_x86_64.tar.gz
+chmod +x GFAffix-0.1.5b_linux_x86_64/gfaffix
+if [[ $STATIC_CHECK -ne 1 || $(ldd GFAffix-0.1.5b_linux_x86_64/gfaffix | grep so | wc -l) -eq 0 ]]
 then
-	 mv GFAffix-0.1.5_linux_x86_64/gfaffix ${binDir}
+	 mv GFAffix-0.1.5b_linux_x86_64/gfaffix ${binDir}
 else
 	 exit 1
 fi

diff --git a/build-tools/makeCpuDockerRelease b/build-tools/makeCpuDockerRelease
@@ -24,10 +24,11 @@ git checkout "${REL_TAG}"
 git submodule update --init --recursive
 
 docker build . -f Dockerfile -t ${dockname}:${REL_TAG}
+docker tag ${dockname}:${REL_TAG} ${dockname}:latest
 
 read -p "Are you sure you want to push ${dockname}:${REL_TAG} to quay?" yn
 case $yn in
-    [Yy]* ) docker push ${dockname}:${REL_TAG}; break;;
+    [Yy]* ) docker push ${dockname}:${REL_TAG} && docker push ${dockname}:latest ; break;;
     [Nn]* ) exit;;
     * ) echo "Please answer yes or no.";;
 esac

diff --git a/doc/pangenome.md b/doc/pangenome.md
@@ -128,6 +128,7 @@ The Minigraph-Cactus pipeline is run via the `cactus-pangenome` command. It cons
 **Before running large jobs, it is important to consider the following options:**
 
 * `--mgCores` the number of cores for `minigraph` construction (default: all available)
+* `--mgMemory` the amount of memory for `minigraph` construction. The default estimate can be quite conservative (ie high), so if it is too high for your system, you can lower it with this option (default: estimate based on input size). 
 * `--mapCores` the number of cores for each `minigraph` mapping job (default: up to 6)
 * `--consCores` the number of cores for each `cactus-consolidated` job (default: all available)
 * `--consMemory` the amount of memory for each `cactus-consolidated` job. By default, it is estimated from the data but these estimates being wrong can be catastrophic on [SLURM](./progressive.md#running-on-a-cluster). Consider setting to the maximum memory you have available when running on a cluster to be extra safe (seems to be more of an issue for non-human data)

diff --git a/doc/progressive.md b/doc/progressive.md
@@ -170,12 +170,12 @@ The Cactus Docker image contains everything you need to run Cactus (python envir
 
 ```
 wget -q https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactus/master/examples/evolverMammals.txt -O evolverMammals.txt
-docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.8 cactus /data/jobStore /data/evolverMammals.txt /data/evolverMammals.hal
+docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.13 cactus /data/jobStore /data/evolverMammals.txt /data/evolverMammals.hal
 ```
 
 Or you can proceed interactively by running
 ```
-docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.8 bash
+docker run -v $(pwd):/data --rm -it quay.io/comparative-genomics-toolkit/cactus:v2.6.13 bash
 cactus /data/jobStore /data/evolverMammals.txt /data/evolverMammals.hal
 
 ```
@@ -204,14 +204,24 @@ export TOIL_SLURM_ARGS="--nice=5000"
 
 to avoid making too many enemies.
 
+You can (and probably should) use the `--batchLogsDir` option in order to enable more SLURM logging.  You must pass it a directory that already exists.  Ex.
+
+```
+mkdir -p batch-logs
+cactus ./js ./examples/evolverMammals.txt evolverMammals.hal --batchSystem slurm --batchLogsDir batch-logs
+```
+
+You'll want to clean out this directory after a successful run. 
+
+
 You cannot run `cactus --batchSystem slurm` from *inside* the Cactus docker container, because the Cactus docker container doesn't contain slurm.  Therefore in order to use slurm, you must be able to `pip install` Cactus inside a virtualenv on the head node. You can still use `--binariesMode docker` or `--binariesMode` singularity to run cactus *binaries* from a container, but the Cactus Python module needs to be installed locally.
 
 **IMPORTANT**
 
 To run Progressive Cactus with CPU (default) lastz, you should increase the chunk size.  This will divide the input assemblies into fewer pieces, resulting in fewer jobs on the cluster.
 
 ```
-cp cactus-bin-v2.6.8/src/cactus/cactus_progressive_config.xml ./config-slurm.xml
+cp cactus-bin-v2.6.13/src/cactus/cactus_progressive_config.xml ./config-slurm.xml
 sed -i config-slurm.xml -e 's/blast chunkSize="30000000"/blast chunkSize="90000000"/g'
 sed -i config-slurm.xml -e 's/dechunkBatchSize="1000"/dechunkBatchSize="200"/g'
 ```
@@ -340,6 +350,11 @@ We've tested SegAlign on Nvidia V100 and A10G GPUs. See the Terra example above
 
 Please [cite SegAlign](https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00043).
 
+### Using GPU Acceleration on a Cluster
+
+Since `SegAlign` is only released in the GPU-enabled docker image, that's the easiest way to run it. When running on a cluster, this usually means the best way to use it is with `--binariesMode docker --gpu <N>`.  This way cactus is installed locally on your virtual environment and can run slurm commands like `sbatch` (that aren't available in the Cactus container), but SegAlign itself will be run from inside Docker.
+
+**Important**: Consider using `--lastzMemory` when using GPU acceleration on a cluster. Like `--consMemory`, it lets you override the amount of memory Toil requests which can help with errors if Cactus's automatic estimate is either too low (cluster evicts the job) or too high (cluster cannot schedule the job).  
 
 ## Pre-Alignment Checklist
 

diff --git a/examples/evolverPrimates.txt b/examples/evolverPrimates.txt
@@ -1,4 +1,6 @@
-simHuman	https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simHuman.chr6
+(simOrang:0.00993,((simChimp:0.00272,simHuman:0.00269)cb:0.00415,simGorilla:0.00644)hcb:0.00046);
+
+simOrang	https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simOrang.chr6
 simChimp	https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simChimp.chr6
+simHuman	https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simHuman.chr6
 simGorilla	https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simGorilla.chr6
-simOrang	https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/primates/loci1/simOrang.chr6
diff --git a/setup.py b/setup.py
@@ -24,7 +24,7 @@ def run(self):
 
 setup(
     name = "Cactus",
-    version = "2.6.8",
+    version = "2.6.13",
     author = "Benedict Paten",
     package_dir = {'': 'src'},
     packages = find_packages(where='src'),

diff --git a/src/cactus/blast/cactus_blast.py b/src/cactus/blast/cactus_blast.py
@@ -19,6 +19,7 @@
 from cactus.shared.common import enableDumpStack
 from cactus.shared.common import cactus_override_toil_options
 from cactus.shared.version import cactus_commit
+from cactus.progressive.cactus_prepare import human2bytesN
 
 from cactus.paf.local_alignment import sanitize_then_make_paf_alignments
 
@@ -60,8 +61,11 @@ def main():
     parser.add_argument("--binariesMode", choices=["docker", "local", "singularity"],
                         help="The way to run the Cactus binaries", default=None)
     parser.add_argument("--gpu", nargs='?', const='all', default=None, help="toggle on GPU-enabled lastz, and specify number of GPUs (all available if no value provided)")
-    parser.add_argument("--lastzCores", type=int, default=None, help="Number of cores for each lastz job, only relevant when running with --gpu")
-
+    parser.add_argument("--lastzCores", type=int, default=None, help="Number of cores for each lastz/segalign job, only relevant when running with --gpu")
+    parser.add_argument("--lastzMemory", type=human2bytesN,
+                        help="Memory in bytes for each lastz/segalign job (defaults to an estimate based on the input data size). "
+                        "Standard suffixes like K, Ki, M, Mi, G or Gi are supported (default=bytes))", default=None)
+
     options = parser.parse_args()
 
     setupBinaries(options)
@@ -95,7 +99,7 @@ def runCactusBlastOnly(options):
             # load up the seqfile and figure out the outgroups and schedule
             config_node = ET.parse(options.configFile).getroot()
             config_wrapper = ConfigWrapper(config_node)
-            config_wrapper.substituteAllPredefinedConstantsWithLiterals()
+            config_wrapper.substituteAllPredefinedConstantsWithLiterals(options)
             # apply gpu override
             config_wrapper.initGPU(options)
             mc_tree, input_seq_map, og_candidates = parse_seqfile(options.seqFile, config_wrapper)

diff --git a/src/cactus/cactus_progressive_config.xml b/src/cactus/cactus_progressive_config.xml
@@ -57,6 +57,9 @@
 	<!-- trimOutgroups Remove outgroup sequences that don't have an alignment to an ingroup sequence-->
 	<!-- outputSecondaryAlignments Include secondary alignments in the output. If included CAF will use these -->
 	<!-- dechunkBatchSize Parallelize paf_dechunks into batches of at most this size-->
+	<!-- pickIngroupPrimaryAlignmentsSeparatelyToOutgroups Separately make ingroups pick their primary alignment to
+	     other ingroups without outgroups, then get the outgroups to pick their primary alignment to the ingroups. If 0
+	     get every sequence to pick its primary alignment without regard to if the other sequence is an ingroup or outgroup -->
 	<blast chunkSize="30000000"
 		   overlapSize="10000"
 		   mapper="lastz"
@@ -77,10 +80,11 @@
 		   trimIngroups="1"
 		   trimOutgroups="1"
 		   trimMinSize="100"
-		   trimFlanking="10"
+		   trimFlanking="100"
 		   trimOutgroupFlanking="2000"
 		   outputSecondaryAlignments="0"
 		   dechunkBatchSize="1000"
+		   pickIngroupPrimaryAlignmentsSeparatelyToOutgroups="1"
 		   >
 
 		<!-- The following are parametrised to produce the same results as the default settings,
@@ -131,7 +135,7 @@
 	<!-- minimumBlockHomologySupport TODO-->
 	<!-- writeInputAlignmentsTo Debug option to write the alignment chains fed to CAF to the specified path. Off by default.-->
 	<caf
-		 deannealingRounds="2 32 512"
+		 deannealingRounds="2 32 256"
 		 trim="3"
 		 blockTrim="5"
 		 minimumBlockDegree="2"
@@ -165,7 +169,7 @@
 				three="1024"
 				four="512"
 				five="512"
-				default="512"
+				default="256"
 		/>
 	</caf>