CWIC (Cloud Workstation In Container)

CWIC App (platform login required to access this link) lets you

  • Use a file-based, portable, interactive analysis environment familiar to Unix users

    • Read and list DNAnexus files through a familiar file system interface without any extra download commands.

    • Configure your environment the way you like it and save it in an external, versioned container registry.

    • Restart your work on the cloud, local HPC, or your laptop.

  • Scale out and parallelize your interactive analysis with a single command

    • A single qsub-like command lets you scale out on DNAnexus cloud infrastructure

    • Under the hood, you launch a fleet of workers that pull the most recent docker image from the registry configured in the credential file, execute the specified command on the worker, and output the results to cloud storage.

This document will guide you through setting up and using CWIC. You should be familiar with the DNAnexus system and have the dx command line tool installed on your local computer environment.

NOTE: A permission is required to access beta release of the CWIC App. Please contact [email protected] for more information.

Configuring SSH

If you haven't already, you will need to configure your account to allow use of SSH connections using dx ssh_config. For more information on configuring your account and connecting to jobs, click here.

Setting up a docker repository account

CWIC uses DockerHub or Quay.io repository to save the docker images your environment.

Setting up a Docker Hub account

To get started, go to Docker Hub and sign in or create an account. Every Docker Hub account is given one free private repository. It is highly recommended to use a private repository as this will be your working environment.

Once you have a Docker Hub account, go to your “Account Settings”, then “Security” and create a new access token.

You can give the new access token a descriptive name.

You will need to copy the token and save it securely or use it immediately.

Note that the token is shown only once at the time of creation. If you lose this token, you will need to delete the lost token and create a new one.

The access token will be needed for the credentials file below.

Setting up a Quay.io account

To get started, go to Quay.io and sign in or create an account. You or your organization needs to set up a paid Quay.io account so you can use a private Quay.io repository to store your working environment. Public Quay.io docker image repositories are free, but we recommend using a private image repository if you don’t want to share your working environment with the world.

Once you have signed in to Quay.io, click on your user name, then go to your “Account Settings”, then click on “Generate Encrypted Password” to create a new access token.

Enter your Quay.io password:

And copy the access token to Quay.io that will be needed for the CWIC credentials file below.

Create a DNAnexus Authentication Token

DNAnexus Authentication Token allows you to run batch jobs within a CWIC session. Follow these instructions to create a DNAnexus authentication token.

Creating a credentials file

Create a file with the template below and fill in your Docker Hub token, Docker Hub username and the DNAnexus authentication token in the appropriate places.

{
"docker_registry": {
"registry": "docker.io",
"username": "<YOUR_DOCKER_REGISTRY_USERNAME>",
"token": "<YOUR_DOCKER_REGISTRY_TOKEN>",
"organization": "<YOUR_DOCKER_REGISTRY_ORG>"
},
"dnanexus": {
"token": "<YOUR_DNANEXUS_AUTHENTICATION_TOKEN>"
}
}

NOTES:

  • If you would rather use a quay.io repository, specify "registry": "quay.io", and use your quay credentials in the credentials file instead.

  • Docker Hub automatically creates an organization named after your docker hub username when you create a Docker Hub account. You can create more Docker Hub organizations through the Docker Hub web interface, and see all the organizations in the upper left dropdown under the “Repositories” tab. Use one of these organizations for the YOUR_DOCKER_REGISTRY_ORG argument in the credentials file.

Once you have made your credentials file (e.g. cwic_creds.json) on your computer, make a new DNAnexus project to save your credentials using dx new project. Upload the credentials file to your project by running dx upload cwic_creds.json. It is recommended to save your credentials in a separate, private DNAnexus project to ensure that others do not have access to it. Make sure to set Copy Policy to “Allow” so you can use the credentials when running CWIC in other projects.

dx new project mycredentials
dx upload cwic_creds.json

Starting interactive CWIC

To start an interactive CWIC session, first select the project you would like to start it in:

dx select "project-alpha"

or justdx selectto select from a list of your projects interactively. You will need CONTRIBUTE or ADMINISTER access to run the app in this project.

Starting an interactive terminal session

The following command will run CWIC app as a DNAnexus job using the credentials you provided and will log you into the CWIC container on the DNAnexus worker running the CWIC app after the worker boots up.

dx run cwic -icredentials=mycredentials:cwic_creds.json --ssh -y

Replace mycredentials with the name of the DNAnexus project with your credentials file. If you have SSH issues while trying to connect to the job, make sure your SSH keys are configured properly.

Working in CWIC

Once the CWIC container starts running, you will be taken to the home directory (/home/cwic) in the CWIC container. The first time you run the CWIC app in a given project, your CWIC container will be initialized with an Ubuntu 18.04-based docker image provided by DNAnexus. You can install any applications and run commands in this environment.

There are two main directories to work with data:

  • /project/ - Your DNAnexus project (e.g. project-alpha) and the data in it is mounted as a local folder and files in this directory. By default this directory is read-only, allowing you to read the content of the DNAnexus project files and, list the files and the associated meta-information such as file sizes.

  • /scratch/ - This directory is local to this instance of CWIC app. You can use this directory to save any intermediate or temporary results. NOTE: You can run tools here but all the data in this directory will be deleted once this instance of CWIC app is terminated.

For example, if you've uploaded a BAM file called test.bam from your laptop to your DNAnexus project prior to running CWIC app, you will be able to access it inside the CWIC environment as/project/<YOUR_DX_PROJECT_NAME>/test.bam

Using byobu text-based window manager

Your interactive session is running inside byobu text-based terminal window manager (similar to tmux and screen). Byobu allows you to create and navigate between multiple terminal windows within a byobu session. The session also preserves the state of your interactive work across disconnecting and reconnecting to the worker running the CWIC app.

The first time you press Ctrl-a, you will be prompted to select between using Ctrl-a key combination as an escape sequence to interact with the terminal (option (1) below) or use it to go to the beginning of a line. In either mode, you can use F2 to create a new window, F3 to move to the previous window, and F4 to move to the next window. In screen mode, you can also use Ctrl-a c, Ctrl-a n and Ctrl-a pkeystrokes to create and navigate between terminal windows. Please consult byobu and screen man pages for a variety of other commands supported by these terminal window managers.

Customizing and saving your environment

You can install samtools by running

apt-get update && apt-get install samtools -y

After you have installed samtools,or any other tool in the CWIC environment, you can save your environment to the Docker repository using the dx-save-cwic command. The new docker image is pushed to <registry>/<organization>/dx-cwic-<dnanexus-project-id-cwic-is-running-in>_<dnanexus-user-id-running-cwic> and is tagged with the epoch timestamp and “latest” label.

NOTE: Don't install tools that you want to reuse into the /scratch directory. All the files in the/scratch directory will not be saved into the new docker image.

The next time you launch the CWIC app in this project, you will be placed into the latest saved environment for the project CWIC is running in. Therefore you will not need to reinstall samtools or any other tools you already installed in your CWIC environment.

Running batch jobs

We can dispatch non-interactive batch jobs from the interactive CWIC environment to parallelize analyses similar to the bsub or qsub experience on the HPC.

In the example below, samtools is used to split the bam file by chromosome using parallel jobs (e.g. individual jobs for chr 1-22, X, Y and MT) running on separate DNAnexus workers (25 workers in this case) simultaneously. dx-cwic-sub is a convenient wrapper around dx run that launches CWIC app in batch mode to execute the specified command with the latest saved environment.

[email protected]:~ for chr in 1 2 3 4 5 6 7 8 9 10 \
11 12 13 14 15 16 17 18 19 20 21 22 X Y MT; do
dx-cwic-sub "samtools view -b /project/<YOUR_DX_PROJECT_NAME>/test.bam ${chr} -o /scratch/${chr}.bam; \
dx upload /scratch/${chr}.bam"
done

dx upload is used in every batch job to upload/save batch job outputs to the DNAnexus project.

NOTE: Always remember to use dx upload to upload/save job outputs to the DNAnexus project at the end of every batch job.

List CWIC jobs

You can use dx-find-cwic-jobs to list all CWIC jobs invoked in the project CWIC is running in.

[email protected]:~# dx-find-cwic-jobs |tail -1
JOB_ID USER STATE INSTANCE_TYPE CMD_OR_JOB_NAME LAUNCH_TIME TAGS PROPERTIES
job-Fq65fgj025KxqqzbPYKPYyv0 user-dxdev done mem1_ssd1_v2_x4 samtools view -b /project/proj_cwic/SRR504516.bam... 2020-03-2 22:10:18 [] {}

Reloading project directory

After your your batch jobs have finished running, you can use dx-reload-project to refresh the /project directory and see the newly added chromosome slices. dx-reload-project reloads the /project directory in the CWIC environment with the latest view of the DNAnexus project.

Terminating interactive CWIC session

Save your work and environment, if needed, by running dx-save-cwic.Detach from the byobu session withF6orCtrl-a d, then exit from the worker withCtrl-d. Type 'y' when prompted to terminate the job. Alternatively, you can terminate the job from the Monitor tab in the web interface or using dx terminate command after saving your interactive work with dx-save-cwic.

NOTE: Always remember to run dx-save-cwic to save your CWIC environment before terminating the CWIC session.

Continue your work on your laptop

You can load your CWIC container on your laptop and continue working on your code inside the CWIC docker container. Note that the docker credentials file and the scripts such as dx-save-cwicanddx-reload-project used by the cloud-based instance of CWIC are excluded from the docker image for security and maintainability reasons.

[laptop ~]$ docker run -it \
dxdevdocker/dx-cwic-project-fzjpqzq0fgk1ppxybjq4zp0j_user-dxdev
/home/cwic
# files with sensitive docker and dnanexus credentials as well as cwic helper scripts
# are only present when container runs inside cwic app on dnanexus
[email protected]:~# ls -lRa /home/cwic/.docker/ /home/cwic/.dnanexus_config/ $(which dx-save-cwic)
-rwxr-xr-x 1 root root 0 May 26 00:13 /usr/local/bin/dx-save-cwic
/home/cwic/.dnanexus_config/:
total 8
drwxr-xr-x 2 root root 4096 May 26 00:13 .
drwxr-xr-x 1 root root 4096 May 26 00:13 ..
/home/cwic/.docker/:
total 8
drwxr-xr-x 2 root root 4096 May 26 00:13 .
drwxr-xr-x 1 root root 4096 May 26 00:13 ..

Tips and Tricks

  • Use--instance-typeargument todx-cwic-subanddx runwith an instance type from this list to select a virtual machine with appropriate amount of memory, storage, and computational resources if the default instance size ofmem2_ssd1_x1(Azure)/mem1_ssd1_v2_x4(AWS) is insufficient.

$ dx run cwic -icredentials=<DX_PROJECT_NAME_WITH_CREDS>:cwic_creds.json \
--instance-type mem1_ssd1_x16 --ssh -y
  • To avoid saving confidential files in the CWIC container (e.g. /root/.ssh) as a part ofdx-save-cwic, store your confidential files on/scratchand symlink to it:

[email protected]:~# mkdir /scratch/root_ssh && ln -s /scratch/root_ssh /root/.ssh
  • dx-save-cwic updates only the last layer in the docker image which includes all changes in the current CWIC session. If you change your docker environment significantly, your subsequent calls to dx-save-cwicmay be slow. Restart the CWIC app to speed up subsequent dx-save-cwiccalls.

  • A CWIC user can exit out of the running CWIC Docker container by typing exit orCtrl+dand end up in the DNAnexus Application Execution environment where the DNAnexus job is executed. To login back into the still running CWIC Docker container, use the dx-load-cwiccommand:

Limitations

  • In order to present a file-based view of the DNAnexus project, we use dxfuse - a FUSE-based file system for DNAnexus. FUSE filesystems do not support reading from privately memory-mapped files. We’ve asked htsjdk maintainers to update the file reader used in GATK to utilize privately memory mapped files.

  • The size of CWIC docker container will be limited to 8GiBto make docker image updates faster and to avoid accidental inclusion of large data files in the docker image, but this limit is currently not enforced.

  • dx-reload-projectmust be called to get project updates such as additions and deletions of project files and folders. This limitation may be relaxed in the future. Please invokedx-reload-project outside of the/projectdirectory in order to avoid the "umount: /project: target is busy"message.

Experimental writable /project mode

You can enable experimental writable /project mode with a -iproject_mount_options="-w" runtime CWIC option. In this mode, you can write to/project/<project_name>directory and have those writes propagate to the DNAnexus project, with the limitations listed below.

  • Writing files to the DNAnexus platform via dxfuse-based /project directory is 2 to 9 times slower compared todx uploaddue to limitations in FUSE / OS interface.

  • A CWIC user should use dx-save-project to avoid a 5 minute delay in pushing local file creation and modification changes to the DNAnexus project. Note that dx-save-projectuploads all the changes as of the time it was invoked even if it takes more than 5 minutes.

  • dx-save-projectmust also be called prior to shutting down an interactive CWIC session to propagate changes to/projectfrom the CWIC environment to the DNAnexus platform.

  • dxfuse limits updates to existing project files to files under 16MB. This restriction may be relaxed in the future. This restriction does not apply to creating new files or updating files created inside the current dxfuse session.

  • dx-reload-projectdoes not lock/projectfor writing prior to reloading information and may interfere with writes to/projectdirectory. CWIC user is responsible for finishing all reads and writes to/projectprior to invokingdx-reload-project.