Using DXJupyterLab

Use Jupyter notebooks on the DNAnexus Platform to craft sophisticated custom analyses in your preferred coding language.

DXJupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

A license is required to access DXJupyterLab on the DNAnexus Platform. Contact DNAnexus Sales for more information.

Jupyter notebooks are a popular way to track the work performed in computational experiments the way a lab notebook tracks the work done in a wet lab setting. DXJupyterLab, or JupyterLab, is an application provided by DNAnexus that allows you to perform computational experiments on the DNAnexus Platform using Jupyter notebooks. DXJupyterLab allows users on the DNAnexus Platform to collaborate on notebooks and extends JupyterLab with options for directly accessing a DNAnexus project from the JupyterLab environment.

Why Use DXJupyterLab?

DXJupyterLab supports the use of Bioconductor and Bioconda, useful tools for bioinformatics analysis.

DXJupyterLab is a versatile application that can be used to:

Collaborate on exploratory analysis of data
Reproduce and fork work performed in computational analyses
Visualize and gain insights into data generated from biological experiments
Create figures and tables for scientific publications
Build and test algorithms directly in the cloud before creating DNAnexus apps and workflows
Test and train machine/deep learning models
Interactively run commands on a terminal

The DNAnexus Platform offers two different DXJupyterLab apps. One is a general-purpose JupyterLab application. The other is Spark cluster-enabled, and can be used within the DNAnexus Apollo framework.

Both apps instantiate a JupyterLab server that allows for data analyses to be interactively performed in Jupyter notebooks on a DNAnexus worker.

The DXJupyterLab Spark Cluster app contains all the features found in the general-purpose DXJupyterLab along with access to a fully-managed, on-demand Spark cluster for big data processing and translational informatics.

Current Version

DXJupyterLab 2.2 is the default version available on the DNAnexus Platform. Older versions are also available.

Creating Interactive Notebooks

A step-by-step guide on how to start with DXJupyterLab and create and edit Jupyter notebooks can be found in the Quickstart.

DXJupyterLab Environments

Creating a DXJupyterLab session requires the use of two different environments:

The DNAnexus project (accessible through the web platform and the CLI).
The worker execution environment.

The Project on the DNAnexus Platform

You have direct access to the project in which the application is run from the JupyterLab session. The project file browser (which lists folders, notebooks, and other files in the project) can be accessed from the DNAnexus tab in the left sidebar or from the terminal:

The project is selected when the DXJupyterLab app is started and cannot be subsequently changed.

The project file browser displays all subfolders, all Jupyter notebooks, and files in the project (limited to the 1,000 most recently modified files). In the Spark-enabled app databases in the project are also visible (limited to the 1,000 most recently modified databases). A list of all the objects in the project can be obtained programmatically or by using dx ls. The file listing is refreshed every 10s and it is possible to enforce a refresh by clicking on the whirl arrow icon in the top right corner of the file browser.

Worker Execution Environment

When you open and run a notebook from the DNAnexus file browser the kernel corresponding to this notebook is started in the worker execution environment and will be used to execute the notebook code. DNAnexus notebooks will have a [DX] prepended to the notebook name in the tab of all opened notebooks.

The execution environment file browser can be accessed from the left sidebar (notice the folder icon at the top) or from the terminal:

You can also create Jupyter notebooks in the worker execution environment through the File menu. These notebooks are stored on the local file system of the DXJupyterLab execution environment and have to be persisted in a DNAnexus project. More information about saving can be found in the next section.

Local vs. DNAnexus Notebooks

DNAnexus Notebooks

You can create, edit, and save notebooks directly in the DNAnexus project as well as duplicate, delete, or download them to your local machine. Notebooks stored in your DNAnexus project, which are housed within the DNAnexus tab on the left sidebar, are fetched from and saved to the project on the DNAnexus Platform without being stored in the JupyterLab execution environment file system. These are referred to as "DNAnexus notebooks" and these notebooks persist in the DNAnexus project after the DXJupyterLab instance is terminated.

DNAnexus notebooks can be recognized by the [DX] that is prepended to its name in the tab of all opened notebooks.

DNAnexus notebooks can be created by clicking the DNAnexus Notebook icon from the Launcher tab that appears on starting the JupyterLab session, or by clicking the DNAnexus tab on the upper menu and then clicking "New notebook". The Launcher tab can also be opened by clicking File and then selecting "New Launcher" from the upper menu.

Local Notebooks

To create a new local notebook, click the File tab in the upper menu and then select "New" and then "Notebook". These non-DNAnexus notebooks can be saved to DNAnexus by simply dragging and dropping them in the DNAnexus file viewer in the left panel.

Accessing Data

In JupyterLab, users can access input data that is located in a DNAnexus project in one of the following ways.

For reading the input file multiple times or for reading a large fraction of the file in random order:
- Download the file from the DNAnexus project to the execution environment with dx download and access the downloaded local file from Jupyter notebook.
For scanning the content of the input file once or for reading only a small fraction of file's content:
- A project in which the app is running is mounted in a read-only fashion at /mnt/project folder. Reading the content of the files in /mnt/project dynamically fetches the content from the DNAnexus Platform, so this method uses minimal disk space in the JupyterLab execution environment, but uses more API calls to fetch the content.

Uploading Data

Files, such as local notebooks, can be persisted in the DNAnexus project by using one of these options:

dx upload in bash console.
Drag the file onto the DNAnexus tab that is in the column of icons on the left side of the screen. This will upload the file into the currently selected DNAnexus folder.

Exporting DNAnexus Notebooks

Exporting DNAnexus notebooks to formats such as HTML or PDF is not supported. However, you can dx download the DNAnexus notebook from the current DNAnexus project to the JupyterLab environment and export the downloaded notebook. For exporting local notebook to certain formats, the following commands might be needed beforehand: apt-get update && apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic.

Non-Interactive Execution of Notebooks

A command can be executed in the DXJupyterLab worker execution environment without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array to the DXJupyterLab app. The provided command will run in the /opt/notebooks/directory and any output files generated in this directory will be uploaded to the project and returned in the out output field of the job that ran DXJupyterLab app.

The cmd input makes it possible to use the papermill command that is pre-installed in the DXJupyterLab environment to execute notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:

my_cmd="papermill notebook.ipynb output_notebook.ipynb"
dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"

Where notebook.ipynb is the input notebook to the papermill command, which is passed to the dxjupyterlab app using the in input, and output_notebook.ipynb is the name of the output notebook, which will contain the result of executing the input notebook and will be uploaded to the project at the end of app's execution. See the DXJupyterLab app page for details.

Collaboration in the Cloud

Collaborators can work on notebooks in the project without the risk of overwriting each other's changes.

Notebook Locking During Editing

If a user has opened a specific notebook in a JupyterLab session, other users cannot open or edit the notebook. This is indicated by a red lock icon next to the notebook's name.

It is still possible to create a duplicate to see what changes are being saved in the locked notebook or to continue work on this "forked" version of the notebook. To copy a notebook, right-click on its name and select Duplicate. After a few seconds, a notebook with the same name and a "copy" suffix should appear in the project.

Once the editing user closes the notebook, the lock will be released and anybody else with access to the project will be able to open it.

Notebook Versioning

Whenever a notebook is saved in the project, it is uploaded to the platform as a new file that will replace the previous version, that is, the file of the same name. The previous version is moved to the .Notebook_archive folder with a timestamp suffix added to its name and its ID is saved in the properties of the new file. Saving notebooks directly in the project ensures that your analyses won't be lost when the DXJupyterLab session ends.

Session Timeout Control

DXJupyterLab sessions begin with a set duration, after which they will shut down automatically. The timeout clock is displayed in the footer on the right side and it can also be adjusted there (using the Update duration button). Even if the DXJupyterLab webpage is closed, the termination will be executed at the set timestamp. Job lengths have an upper limit of 30 days, which cannot be extended.

A session can be terminated immediately from the top menu (DNAnexus > End Session).

Environment Snapshots

It is possible to save the current session environment and data and reload it later by creating a session snapshot (DNAnexus > Create Snapshot).

A DXJupyterLab session is run in a Docker container, and a session snapshot file is a tarball generated by saving the Docker container state (with the docker commit and docker save commands). Any installed packages and files created locally are saved to a snapshot file, except for directories /home/dnanexus and /mnt/, which are not included. This file is then uploaded to the project to .Notebook_snapshots and can be passed as input the next time the app is started.

If many large files are created locally, the resulting snapshots will take a long time to save and load. In general, it is recommended not to snapshot more than 1 GB of locally saved data/packages and rely on downloading larger files as needed.

Snapshots Created in Older Versions of DXJupyterLab

Snapshots created with DXJupyterLab versions older than 2.0.0 (released mid-2023) won't work with the current version. These old snapshots contain tool versions that may conflict with the newer environment, which could cause problems.

Using Older Snapshots in the Current Version of DXJupyterLab

If you want to use an older snapshot in the current version of DXJupyterLab, you'll need to recreate the snapshot as follows:

Create a tarball incorporating all the necessary data files and packages.
Save the tarball in a project.
Launch the current version of DXJupyterLab.
Import and unpack the tarball file.
Create a snapshot of the DXJupyterLab environment.

Accessing an Older Snapshot in an Older Version of DXJupyterLab

If you don't want to have to recreate your older snapshot, you can run an older version of DXJupyterLab and access the snapshot there.

Viewing Other Files in the Project

Viewing any other file types from your project, such as CSV, JSON, PDF files, images, or scripts, is convenient because JupyterLab displays them accordingly. For example, JSON files will be collapsible and easy to navigate and CSV files will be presented in the tabular format.

However, editing and saving any open files from the project other than IPython notebooks will result in an error.

Permissions in the JupyterLab Session

The JupyterLab apps are run in a specific project, defined at start time, and this project cannot be subsequently changed. The job associated with the JupyterLab app has CONTRIBUTE access to the project in which it is run.

When running DXJupyterLab app, it is possible to view, but not update, other projects the user has access to. This enhanced scope is required to be able to read databases which may be located in different projects and cannot be cloned.

Running Jobs in the JupyterLab Session

You can start new jobs using dx run from within a notebook or the terminal. If the billTo for the project where your JupyterLab session is running does not have a license for detached executions, any jobs you start will run as subjobs of your interactive JupyterLab session. In this situation, the --project argument for dx run is ignored, and the job will use the JupyterLab session's workspace instead of the specified project. If a subjob fails or is terminated on the DNAnexus Platform, the entire job tree—including your interactive JupyterLab session—will also be terminated.

Jobs are limited to a runtime of 30 days. Jobs running longer than 30 days will be automatically terminated.

Environment and Feature Options

The DXJupyterLab app is a Docker-based app that runs the JupyterLab server instance in a Docker container. The server runs on port 443. Because it's an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app. Only the user who launched the JupyterLab job has access to the JupyterLab environment. Other users will see a "403 Permission Forbidden" message under the JupyterLab session's URL.

On the DNAnexus Platform, the JupyterLab server runs in a Python 3.9.16 environment, in a container running Ubuntu 20.04 as its operating system.

Feature Options

When launching JupyterLab, the feature options available are PYTHON_R, ML, IP, and STATA.

Selecting the PYTHON_R feature (default option) loads the environment with Python3 and R kernel and interpreter.

Selecting the ML feature loads the environment with Python3 and machine learning packages, such as TensorFlow, PyTorch, CNTK as well as the image processing package NiPype, but it does not contain R.

Selecting the IMAGE_PROCESSING feature loads the environment with Python3 and Image Processing packages such as NiPype, FreeSurfer and FSL but it does not contain R. The FreeSurfer package requires a license to run. Details about license creation and usage can be found in the FreeSurfer in DXJupyterLab guide.

The STATA feature requires a license to run. See Stata in DXJupyterLab for more information about running Stata in JupyterLab.

Full List of Pre-Installed Packages

See the in-product documentation for the full list of pre-installed packages. This list includes details on feature-specific packages available when running the PYTHON_R, ML, IMAGE_PROCESSING, and STATA features.

Installing Additional Packages

Additional packages can easily be installed during a JupyterLab session. By creating a Docker container snapshot, users can then start subsequent sessions with the new packages pre-installed by providing the snapshot as input.

JupyterLab Documentation

For more information on the features and benefits of JupyterLab, see the official JupyterLab documentation.

Next Steps

Create your first notebooks by following the instructions in the Quickstart guide.
See the DXJupyterLab Reference guide for tips and info on the most useful DXJupyterLab features.

Last updated 1 month ago

Was this helpful?