-
Notifications
You must be signed in to change notification settings - Fork 28
Dockstore New Developer Tutorial
Please note, this wiki tutorial was moved to the main Dockstore.org site and has been updated. See https://dockstore.org/docs/getting-started
This guide will walk you through a real example of adding a tool to the Dockstore and using it. I will also point our tips and tricks along the way to help you avoid common issues that make tools more difficult to use and less portable. A future tutorial will show you how to register a workflow (in CWL or WDL).
You need both accounts to online services along with some software installed on your development host.
The first step is to establish accounts with key services, if you haven't already:
You may alternatively (or additionally) sign up for accounts at the following:
These aren't required since GitHub and Quay provide these same services. This tutorial will focus on GitHub and Quay.
This tutorial assumes you are using a Linux host running Ubuntu 14.04 and have the following installed:
- Python 2.7
- Java 1.8
- Docker
- cwltool: you need to pip install cwltool, with
pip install setuptools==24.0.3 && pip install cwl-runner cwltool==1.0.20160316150250 schema-salad==1.7.20160316150109 avro==1.7.7
. The Dockstore will direct you to do this when you register - Dockstore CLI: you will be prompted to download this when you register
The processes you use to install these really depends on the system your user. See the bottom of this tutorial (Install Tips) for the specific commands I used on a fresh Ubuntu 14.04 VM.
To check to see if things are configured correctly open a terminal and execute the following. You should see very similar output. If there are errors make sure you follow the setup instructions carefully:
$> java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
$> docker --version
Docker version 1.11.1, build 5604cbe
$> cwltool --version
/usr/local/bin/cwltool 1.0.20160316150250
$> dockstore --version
Dockstore version 0.4-beta.1
You are running the latest stable version...
...
In addition to the tools mentioned above you will probably want a editor capable syntax highlighting Dockerfiles such as Atom.
Dockstore is a registry which means we don't actually store your source files, build your Docker images, or host them for the community. We are purely a place where you can register, and describe how to call, your Docker-based tools. For this reason we depend on GitHub (or Bitbucket) and Quay (or DockerHub) to provide these services.
The first step is to establish the Git source repository you will work out of.
This will contain your Dockerfile
which describes how to build your Docker images
that has all your tools, reference files, configs, etc installed in it. Next
the source repository will contain a Dockstore.cwl
(or Dockstore.wdl
) that
describes the tool installed inside the Docker image and how to execute it.
For this tutorial I'm going to use my own personal git repository located at:
https://github.com/briandoconnor/dockstore-tool-bamstats
This will be used to create a Docker-based version of the bamstats command, a simple tool that provides statistics on BAM files.
The process of creating a git repository on github is beyond the scope of this tutorial but details directions can be found on GitHub's help page.
You can follow along on this tutorial by "forking" my repository above into your own GitHub account. Or you can create your own git repository for another tool that you want to share on Dockstore.
With a repository established in GitHub, the next step is to create the Docker image
with BAMStats correctly installed. You need to create a Dockerfile
, this contains
the instructions necessary for creating a Docker image that contains all the
dependencies of BAMStats along with the executable itself.
Here's my sample Dockerfile:
#############################################################
# Dockerfile to build a sample tool container for BAMStats
#############################################################
# Set the base image to Ubuntu
FROM ubuntu:14.04
# File Author / Maintainer
MAINTAINER Brian OConnor <[email protected]>
# Setup packages
USER root
RUN apt-get -m update && apt-get install -y wget unzip openjdk-7-jre zip
# get the tool and install it in /usr/local/bin
RUN wget -q http://downloads.sourceforge.net/project/bamstats/BAMStats-1.25.zip
RUN unzip BAMStats-1.25.zip && \
rm BAMStats-1.25.zip && \
mv BAMStats-1.25 /opt/
COPY bin/bamstats /usr/local/bin/
RUN chmod a+x /usr/local/bin/bamstats
# switch back to the ubuntu user so this tool (and the files written) are not owned by root
RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 ubuntu
USER ubuntu
# by default /bin/bash is executed
CMD ["/bin/bash"]
This Dockerfile has a lot going on in it. There are good tutorials online about the details of Dockerfile and its syntax. An excellent resource is the Docker website itself, including the Best practices for writing Dockerfiles webpage. I'll highlight some sections below:
FROM ubuntu:14.04
This uses the ubuntu 14.04 base distribution. How do I know to use ubuntu:14.04
? This comes from either a
search on Ubuntu's home page for their "official" Docker images or you can simply go to DockerHub
or Quay and search for whatever base image you like. You can extend anything you find there
so if you come across an image that contains most of what you want you can use it as the base here. Just be
aware of its source, I tend to stick with "official", basic images for security reasons.
MAINTAINER Brian OConnor <[email protected]>
You should include your name and contact information.
USER root
RUN apt-get -m update && apt-get install -y wget unzip openjdk-7-jre zip
RUN wget -q http://downloads.sourceforge.net/project/bamstats/BAMStats-1.25.zip
RUN unzip BAMStats-1.25.zip && \
rm BAMStats-1.25.zip && \
mv BAMStats-1.25 /opt/
This switches to the root
user to perform software installs. It downloads
BAMStats, unzips it, and installs it in the correct location, here it's
/opt
.
This is why Docker is so powerful. On HPC systems the above process might take days or weeks of working with a sys admin to install dependencies on all compute nodes. Here I can control and install whatever I like inside my Docker image, correctly configuring the environment for my tool and avoiding the time to setup these dependencies in the places I want to run. This greatly simplifies the install process for other users that you share your tool with as well.
COPY bin/bamstats /usr/local/bin/
RUN chmod a+x /usr/local/bin/bamstats
This copies the local helper script bamstats
from the git checkout directory
to /usr/local/bin
. This is an important example, it shows how to use COPY
to copy files in the git directory structure to inside the Docker image.
After copying to /usr/local/bin
the script is made runnable by all users.
RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 ubuntu
USER ubuntu
# by default /bin/bash is executed
CMD ["/bin/bash"]
The user ubuntu
is created and switched to in order to make file ownership easier and the default
command for this Docker image is set to /bin/bash
which is a typical default.
An important thing to note, this Dockerfile
just really scratches the surface. Take a look at
Best practices for writing Dockerfiles
for a really terrific in-depth look at writing Dockerfiles.
Now that you've created the Dockerfile
the next step is to build the image.
The docker command line is used for this:
$> docker build -t quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 .
The .
is the path to the location of the Dockerfile, which is in the same directory here.
The -t
parameter is the "tag" that this Docker image will be called locally
when it's cached on your host. A few things to point out, the quay.io
part
of the tag typically denotes that this was built on Quay.io (which we will see in
the next section). I'm manually specifying this tag so it will match the quay.io
built version. This allows me to build and test locally then, eventually,
switch over to the quay.io-built version. The next part of the tag,
briandoconnor/dockstore-tool-bamstats
, denotes the name of the tool which
is derived from the organization and repository name on GitHub. Finally 1.25-3
denotes a version string, typically you want to sync that with releases on GitHub.
In this case I'm working on release 1.25-3
so this is on a release branch. However
the most recent release via GitHub is the previous version 1.25-2
. The
ramifications of this will come up in the Quay section below.
Really, you could use whatever you want for the tag but, practically, you want this to match what Quay will use, aka your next release, so that's what I'm doing here. The tool should build normally and should exit without errors. You should see something like:
Successfully built 01a7ccf55063
Check that the tool is now in your local Docker image cache:
$> docker images | grep bamstats
quay.io/briandoconnor/dockstore-tool-bamstats 1.25-3 01a7ccf55063 2 minutes ago 538.3 MB
Great! This looks fine!
OK, so you've built the image. Now what?!
The next step will be to test the tool directly via Docker to ensure that
your Dockerfile
is valid and correctly installed the tool. If you were developing
a new tool there might be multiple rounds of docker build
, followed by tesing with
docker run
before you get your Dockerfile right. Here I'm executing
the Docker image, launching it as a container (make sure you launch on a host with
at least 8GB of RAM and dozens of GB of disk space!):
$> docker run -it -v `pwd`:/home/ubuntu quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 /bin/bash
You'll be dropped into a bash shell which works just like the Linux environments
you normally work in. I'll come back to what -v
is doing in a bit.
The goal now is to exercise the tool and make sure it works
as you expect. BAMStats is a very simple tool and generates some reports and
statistics for a BAM file. Let's run it on some test data from the 1000 Genomes
project:
# this is inside the running Docker container
$> cd /home/ubuntu
$> wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
# if the above doesn't work here's an alternative location
$> wget https://s3.amazonaws.com/oconnor-test-bucket/sample-data/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
$> /usr/local/bin/bamstats 4 NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
What's really going on here? The bamstats
command above is a simple script I wrote
to make it easier to call BAMStats. This is what I used the COPY
command to move
into the Docker image via the Dockerfile. Here's the script's contents:
#!/bin/bash
set -euf -o pipefail
java -Xmx$1g -jar /opt/BAMStats-1.25/BAMStats-1.25.jar -i $2 -o bamstats_report.html -v html
zip -r bamstats_report.zip bamstats_report.html bamstats_report.html.data
rm -rf bamstats_report.html bamstats_report.html.data
You can see it just executes the BAMStats jar, passing in the GB of memory and the BAM file while collecting the output HTML report as a zip file followed by cleanup.
An important thing to note, notice how the output is written to whatever the current directory is. This is the correct directory to put your output in since the CWL tool described later assumes that outputs are all located in the current working directory that it executes your command in.
The -v
parameter used earlier is mounting the current working directory into
/home/ubuntu
which was the directory we worked in when running /usr/local/bin/bamstats
above. The net effect is when you exit the Docker container you're left with a
bamstats_report.zip
file in the current directory. This is a key point,
it shows you how files are retrieved from inside a Docker container.
You can now unzip and examine the bamstats_report.zip
file on your computer
to see what type of reports are created by this tool. For example, here's a snippet:
At this point you have a working Docker image. You could use the docker push
command to send that
to Quay or DockerHub and share with others. However, what you loose is a standardized way to describe
how to run your tool. That's what the CWL descriptor and Dockstore provide. We think it's valuable
and there's an increasing number of tools designed to work with CWL so there are benefits to not just stopping here.
At this point we have validated that the Docker image is good and the BAMStats tool works as expected. The next step is to describe how to call the BAMStats tool using the Common Workflow Language. This is a human- and machine-readible format that describes how tools can be called inside a Docker image.
Here's the Dockstore.cwl
for BAMStats tool:
#!/usr/bin/env cwl-runner
class: CommandLineTool
id: "BAMStats"
label: "BAMStats tool"
cwlVersion: cwl:draft-3
description: |
A Docker container for the BAMStats command. See the [BAMStats](http://bamstats.sourceforge.net/) website for more information.
```
Usage:
# fetch CWL
$> dockstore cwl --entry quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 > Dockstore.cwl
# make a runtime JSON template and edit it (or use the content of sample_configs.json in this git repo)
$> dockstore convert cwl2json --cwl Dockstore.cwl > Dockstore.json
# run it locally with the Dockstore CLI
$> dockstore launch --entry quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 \
--json Dockstore.json
```
dct:creator:
"@id": "http://orcid.org/0000-0002-7681-6415"
foaf:name: Brian O'Connor
foaf:mbox: "mailto:[email protected]"
requirements:
- class: DockerRequirement
dockerPull: "quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3"
hints:
- class: ResourceRequirement
coresMin: 1
ramMin: 4092
outdirMin: 512000
description: "the process requires at least 4G of RAM"
inputs:
- id: "#mem_gb"
type: int
default: 4
description: "The memory, in GB, for the reporting tool"
inputBinding:
position: 1
- id: "#bam_input"
type: File
description: "The BAM file used as input, it must be sorted."
format: "http://edamontology.org/format_2572"
inputBinding:
position: 2
outputs:
- id: "#bamstats_report"
type: File
format: "http://edamontology.org/format_3615"
outputBinding:
glob: bamstats_report.zip
description: "A zip file that contains the HTML report and various graphics."
baseCommand: ["bash", "/usr/local/bin/bamstats"]
There's a lot going on here. Let's break it down. The CWL is actually recognized and parsed by Dockstore (when we register this later). By
default it recognizes Dockstore.cwl
but you can customize this if you need to. One of the most important items below is the CWL version, you should label your CWL with the version you are using so tools
that can't run this version can error our appropriately.
class: CommandLineTool
id: "BAMStats"
label: "BAMStats tool"
cwlVersion: cwl:draft-3
description: "A Docker container for the BAMStats command. See the BAMStats website for more information."
These items are recommended and the description is actually parsed and displayed in the Dockstore page. Here's an example:
In the code above you can see how to have an extended description which is quite useful.
dct:creator:
"@id": "http://orcid.org/0000-0002-7681-6415"
foaf:name: Brian O'Connor
foaf:mbox: "mailto:[email protected]"
This section includes the tool author referenced by Dockstore. It's open to your interpretation whether that is the personal that registers the tool, the person who made the Docker image, or the developer of the original tool. I'm biased towards the person that registers the tool since that is likely to be the primary contact when asking questions about how the tool was setup.
requirements:
- class: DockerRequirement
dockerPull: "quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3"
This section links the Docker image used to this CWL. Notice it's exactly the same as the -t
you used when building your image.
hints:
- class: ResourceRequirement
coresMin: 1
ramMin: 4092
outdirMin: 512000
description: "the process requires at least 4G of RAM"
This may or may not be honoured by the tool calling this CWL but at least it gives you a place to declare computational requirements.
inputs:
- id: "#mem_gb"
type: int
default: 4
description: "The memory, in GB, for the reporting tool"
inputBinding:
position: 1
- id: "#bam_input"
type: File
description: "The BAM file used as input, it must be sorted."
format: "http://edamontology.org/format_2572"
inputBinding:
position: 2
This is one of the items from the inputs section. Notice a few things, first, the #bam_input
matches with bam_input
in the sample parameterization JSON.
Also, you can control the position of the variable, it can have a type (int or File here), and, for tools that require a prefix (--prefix
) before a
parameter you can use the prefix
key:value in the inputBindings section.
Also, I'm using the format
field to specify a file format via the EDAM ontology.
outputs:
- id: "#bamstats_report"
type: File
format: "http://edamontology.org/format_3615"
outputBinding:
glob: bamstats_report.zip
description: "A zip file that contains the HTML report and various graphics."
Finally, the outputs section defines the output files. In this case it says in the current working directory there will
be a file called bamstats_report.zip
. When running this tool with CWL tools the file will be copied out of the container to a
location you specify in your parameter JSON file. We'll walk though an example in the next section.
Finally, the baseCommand
is the actual command that will be executed, in this case it's the wrapper script I wrote for bamstats.
baseCommand: ["bash", "/usr/local/bin/bamstats"]
So at this point you've created a Docker-based tool and have described how to call that tool using CWL. Let's test running the BAMStats using the Dockstore command line and descriptor rather than just directly calling it via Docker. This will test that the CWL correctly describes how to run your tool.
First thing I'll do is create a completely local dataset and JSON parameterization file:
$> wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
# alternative location if the above URL doesn't work
$> wget https://s3.amazonaws.com/oconnor-test-bucket/sample-data/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
$> mv NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam /tmp/
This downloads to my current directory and then moves to /tmp
. I could choose another location, it really doesn't matter, but we
need the full path when dealing with the parameter JSON file. I'm using a sample I checked in already: sample_configs.local.json
.
{
"bam_input": {
"class": "File",
"path": "/tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam"
},
"bamstats_report": {
"class": "File",
"path": "/tmp/bamstats_report.zip"
}
}
Tip: the Dockstore CLI can handle inputs at HTTPS, FTP, and S3 URLs but that's beyond the scope of this tutorial.
You can see in the above I give the full path to the input under bam_input
and full path to the output bamstats_report
.
At this point, let's run the tool with our local inputs and outputs via the JSON config file:
$> dockstore tool launch --entry Dockstore.cwl --local-entry --json sample_configs.local.json
Creating directories for run of Dockstore launcher at: ./datastore//launcher-1e43745b-3127-4c56-8204-1e56abb81df2
Provisioning your input files to your local machine
Downloading: #bam_input from /tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam into directory: /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/inputs/91155c9c-fd3b-4edf-871d-b31019ffa0f2
Calling out to cwltool to run your tool
cwltool stdout:
{
"bamstats_report": {
"size": 32012,
"path": "/home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/bamstats_report.zip",
"checksum": "sha1$b3882afae65e54081727a2fef0d3b7bdb9aa22e6",
"class": "File"
}
}
cwltool stderr:
/usr/local/bin/cwltool 1.0.20160316150250
[job 140138530869072] /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/$ docker run -i --volume=/tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam:/var/lib/cwl/job563598407_tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam:ro --volume=/home/ubuntu/gitroot/dockstore-tool-bamstats/datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs:/var/spool/cwl:rw --volume=/tmp/tmpZ8IdIg:/tmp:rw --workdir=/var/spool/cwl --read-only=true --user=1000 --rm --env=TMPDIR=/tmp quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 bash /usr/local/bin/bamstats 4 /var/lib/cwl/job563598407_tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
Total time: 12 seconds
adding: bamstats_report.html (deflated 50%)
adding: bamstats_report.html.data/ (stored 0%)
adding: bamstats_report.html.data/20_Coverage_cumulativeHistogram.png (deflated 14%)
adding: bamstats_report.html.data/20_Coverage_boxAndWhisker.png (deflated 12%)
adding: bamstats_report.html.data/Coverage_boxAndWhisker.png (deflated 1%)
adding: bamstats_report.html.data/20_Coverage_histogram.png (deflated 13%)
adding: bamstats_report.html.data/20_Coverage.html (deflated 60%)
Final process status is success
Saving copy of cwltool stdout to: /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/.cwltool.stdout.txt
Saving copy of cwltool stderr to: /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/.cwltool.stderr.txt
Provisioning your output files to their final destinations
Uploading: #bamstats_report from /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-1e43745b-3127-4c56-8204-1e56abb81df2/outputs/bamstats_report.zip to : /tmp/bamstats_report.zip
[##################################################] 100%
So that's a lot of information but you can see the process was a success. We get output from the command we ran and also see the file being moved to the correct output location:
$> ls -lth /tmp/bamstats_report.zip
-rw-rw-r-- 1 ubuntu ubuntu 32K Jun 16 02:14 /tmp/bamstats_report.zip
The output looks fine, just what we'd expect.
So what's going on here? What's the Dockstore CLI doing? It can best be summed up with this image:
The command line first provisions file. In our case, the files were local so no provisioning was needed. But as the Tip above mentioned, these can be
various URLs. After provisioning the docker image is pulled and ran via the cwltool
command line. This uses the Dockerfile.cwl
and parameterization
JSON file (sample_configs.local.json
) to construct the underlying docker run
command. Finally, the Dockstore CLI provisions files back. In this
case it's just a file copy to /tmp/bamstats_report.zip
but it could copy the result to a destination in S3 for example.
Tip: you can use --debug
to get much more information during this run, including the actual call to cwltool (which can be super helpful in debugging):
cwltool --non-strict --enable-net --outdir /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-08852137-71c1-4b75-b2fc-16ab7ca3243b/outputs/ /home/ubuntu/gitroot/dockstore-tool-bamstats/Dockstore.cwl /home/ubuntu/gitroot/dockstore-tool-bamstats/./datastore/launcher-08852137-71c1-4b75-b2fc-16ab7ca3243b/workflow_params.json
Tip: the dockstore
CLI automatically create a datastore
directory in the current working directory where you execute the command
and uses it for inputs/outputs. It can get quite large depending on the tool/inputs/outputs being used. Plan accordingly e.g. execute
the dockstore CLI in a directory located on a partition with sufficient storage.
At this point we've successfully created our tool in Docker, tested it, written a CWL that describes how to run it, and tested
running this via the Dockstore command line. All of this work has been done locally, so if we encounter problems
along the way its fast to perform debug cycles, fixing problems as we go. At this point we're confident that the tool is ready to share with others
and bug free. It's time to release 1.25-3
Releasing will tag your GitHub repository with a version tag so you always can get back to
this particular release. I'm going to use the tag 1.25-3
which you can see referenced in my Docker image
tag and also my CWL file. GitHub makes it very easy to release:
I click on "releases" in my GitHub project page and then follow the directions to create a new release. Simple as that!
Tip: HubFlow is an excellent way to manage the lifecycle of releases on GitHub. Take a look!
Now that you've perfected the Dockerfile
, have built the image on your local host,
and have tested running the Docker container and tool packaged inside and have
released this version on GitHub, it's time to
push the image to a place where others can use it. For this you can use DockerHub but
we prefer Quay.io since it integrates really nicely with Dockstore.
You can manually docker push
the image you have already built but the most reliable
and transparent thing you can do is link your GitHub repository (and the Dockerfile contained
within) to Quay. This will cause Quay to automatically build the Docker image every
time there is a change.
Log onto Quay now and setup a new repository (click the "+" icon).
You must match the name to what I was using previously, so in this case it's briandoconnor
/ dockstore-tool-bamstats
. Also, Dockstore will
only work with Public
repositories currently.
Notice I'm selecting "Link to a GitHub Repository Push", this is because we want Quay to automatically build our Docker image
every time we update the repository on GitHub. Very slick!
It will automatically prompt you to setup a "build trigger" after GitHub authenticates you. Here I select the GitHub repo
for briandoconnor/dockstore-tool-bamstats
.
It will then ask if there are particular branches you want to build, I typically just let it build everything:
So every time you do a commit to your GitHub repo Quay automatially builds and tags a Docker image. If this is overkill for you, consider setting up particular build trigger regular expressions at this step.
It will then ask you where your Dockerfile is located. Since the Dockerfile is in the root directory of this GitHub repo you can just click next:
At this point you can confirm your settings and "Create Trigger" followed by "Run Trigger Now" to actually perform the build of the Docker images.
Build it for 1.25-3
and any or all other branches. Typically, I build for each release and develop aka latest are built next time I checkin on that
branch.
In my example I should see a 1.25-3
listed in the "tags" for this Quay Docker repository:
And I do, so this Docker image has been built successfully by Quay and is ready for sharing with the community.
So this is great, we've Docker-ized the BAMStats tool, described it with CWL, built and tested it locally, and hooked the GitHub repo up to Quay to have that service automatically build and host the Docker image. The next step is to register it on Dockstore to make finding and sharing this tool easier.
Log into the Dockstore now using your GitHub account: https://dockstore.org
Now click on "My Tools" in the upper-righ corner.
Generally, you should hit the "Refresh All Tools" button to make sure Dockstore has examined your latest repositories on Quay. Do this especially if you created a new repository like we did here.
Now select the briandoconnor/dockstore-tool-bamstats
repository and click "Publish". The tool is now listed on Dockstore!
You can also click on the "Versions" tab and should notice 1.25-3
is present and Valid=Yes. If any versions are invalid it is likely
due to a path issue to the Dockstore.cwl
, Dockerfile
, or Dockstore.wdl
(if used) files. In BAMStats I used the default value of Dockstore.cwl
and Dockerfile
in the root repo directory so this wasn't an issue.
This is the simple part. Now that we've successfully registered the tool on Dockstore you can just send around a link, for example to the BAMStat tool I just registered:
https://www.dockstore.org/containers/quay.io/briandoconnor/dockstore-tool-bamstats
And reproduced here below:
This includes several useful items:
- sample command usage and documentation
- author information
- links to GitHub and Quay
- sharing links to email, tweet, etc
- a discussion comments section
- details about the versions of the tool available (click "Versions" tab)
- the Dockerfile (click "Dockerfile" tab)
- the CWL descriptor (click the "Descriptor" tab)
The last item gives users information on all the parameterization for this tool and the expected outputs.
Almost all tools on Dockstore follow the same model as BAMStats. They are docker-based, described with CWL (or in some cases WDL),
and they can be run via the dockstore
command line interface. Typically, the way one runs the tool follows the pattern shown for BAMStats:
# fetch CWL
$> dockstore tool cwl --entry quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 > Dockstore.cwl
# make a runtime JSON template and edit it (or use the content of sample_configs.json in this git repo)
$> dockstore tool convert cwl2json --cwl Dockstore.cwl > Dockstore.json
# run it locally with the Dockstore CLI
$> dockstore tool launch --entry quay.io/briandoconnor/dockstore-tool-bamstats:1.25-3 \
--json Dockstore.json
Since CWL and Docker are both standards, the latter wildly successfully and the former gaining adoption in the genomics community, we expect other tools and systems will be available that will directly use tools available via Dockstore. Commercial platforms, such as Seven Bridges and DNAStack, are very exciting for us since it would open up Dockstore-based tools to a large audience and platforms capable of running large-scale analysis.
This tutorial was quite long and involved but, hopefully, the final outcome is simple and desirable to use. That being said, we only scratched the surface and there are many issues to explore still, in particular when wrapping more complex tools.
I've organized some of the most important tips we've learned in working with Docker in the PCAWG project which saw the use of many different Docker-based tools and workflows.
- sudo and the docker command, make sure you set up Docker command on your system so you don't need sudo
- don't use sudo inside your Docker-based tools/scripts
- try to use the default user in the container e.g.
USER ubuntu
when using Ubuntu - try to not run as
USER root
inside your container (it can make outputs unreadable) - don't call Docker-inside-Docker (it's possible but causes Docker client/server issues)
- don't depend on changes to
hostname
or/etc/hosts
, Docker will interfere with this - don't design your Docker container to take directories filled with files as inputs, be explicit about input and output files
- keep your Docker images small
- cwltools (which we use to run tools) is restrictive and locks down much of
/
as read only, use the current working directory or $TMPDIR for file writes - the Dockstore CLI uses
./datastore
for temp files so if you're processing large files make sure this partition hosting the current directory is large. - you need to "collect" output from your tools/workflows inside docker and drop them into the current working directory in order for CWL to "find" them and pull them back outside of the container
- related to this, it's often times easiest to write a simple wrapper script that maps the command line arguments specified by CWL to however your tool expects to be parameterized. This script can handle moving output to the current working directory and renaming if need be
- genomics workflows work with large data files, this can have a few ramifications:
- do not "package" large data reference files in your Docker image. Instead, treat them as "inputs" so they can be stagged outside and mounted into the running container
- the
$TMPDIR
variable can be used as a scratch space inside your container. Make sure your host running Docker has sufficient scratch space for processing your genomics data.
- you can use a single Docker image with multiple tools, each of them registered via a different CWL
- you can use a Git repository with multiple CWL files
- related to the two above, you can use non-standard file paths if you customize your registrations in the Version tab of Dockstore
- WDL files, we talked about CWL but WDL works too
- workflows can be registered in Dockstore but were outside the scope of this tutorial. Since it doesn't involve Docker, this is considerably easier/simplier than what we focused on above
Here's a list of commands I used to install the various dependencies on a fresh Ubuntu 14.04 box:
# directions for a fresh Ubuntu 14.04 VM
# java setup
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
# docker setup, see https://docs.docker.com/engine/installation/linux/ubuntulinux/
sudo apt-get install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
sudo vim /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install docker-engine
# make sure you follow directions to add ubuntu to docker group!!
docker run hello-world
# python/pip/cwltools setup
sudo apt-get install python-pip
sudo pip install setuptools==24.0.3
sudo pip install cwl-runner cwltool==1.0.20160316150250 schema-salad==1.7.20160316150109 avro==1.7.7
# WARNING! I had to install this too
sudo pip install typing
# dockstore CLI setup
wget https://github.com/ga4gh/dockstore/releases/download/0.4-beta.4/dockstore
sudo mv `pwd`/dockstore /usr/local/bin/
sudo chmod a+x /usr/local/bin/dockstore
mkdir ~/.dockstore
vim ~/.dockstore/config
# checkout the git repo which has the bamstats example
sudo apt-get install git
mkdir -p gitroot/briandoconnor/
cd gitroot/briandoconnor/
git clone https://github.com/briandoconnor/dockstore-tool-bamstats.git
cd dockstore-tool-bamstats