arrow-left

All pages
gitbookPowered by GitBook
1 of 10

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

JupyterLab Quickstart

In this tutorial, you will learn how to create and run a notebook in JupyterLab on the platform, download data from the notebook, and upload results to the platform.

circle-info

JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

hashtag
Run a JupyterLab Session and Create Notebooks

hashtag
1. Launch JupyterLab and View the Project

First, launch JupyterLab in the project of your choice, as described in the guide.

After starting your JupyterLab session, click on the DNAnexus tab on the left sidebar to see all the files and folders in the project.

hashtag
2. Create an Empty Notebook

To create a new empty notebook in the DNAnexus project, select DNAnexus > New Notebook from the top menu.

This creates an untitled ipynb file, viewable in the DNAnexus project browser, which refreshes every few seconds.

To rename your file, right-click on its name and select Rename.

hashtag
3. Edit and Save the Notebook in the Project

You can open and edit the newly created notebook directly from the project (accessible from the DNAnexus tab in the left sidebar). To save your changes, press Ctrl+S (or Command+S on macOS), or click on the save icon in the Toolbar (an area below the tab bar at the top). A new notebook version lands in the project, and you should see in the "Last modified" column that the file was created recently.

Since DNAnexus files are immutable, each notebook save creates a new version in the project, replacing the file of the same name. The previous version moves to the .Notebook_archive with a timestamp suffix added to its name. Saving notebooks directly in the project as new files preserves your analyses beyond the JupyterLab session's end.

hashtag
4. Download the Data to the Execution Environment

To process your data in the notebook, the data must be available in the execution environment (as is the case with any DNAnexus app).

You can for your notebook using dx download in a notebook cell:

You can also use the to execute the dx command.

hashtag
5. Upload Data to the Project

For any data generated by your notebook that needs to be preserved, before the session ends and the JupyterLab worker terminates. Upload data directly in the notebook by running dx upload from a notebook cell or from the terminal:

circle-info

If you create a notebook from the Launcher or from the top menu (File > New > Notebook), the notebook is not created in the project but in the . To move it to the project, you must upload it to the project manually. Make sure you before the session expires, or work on your notebooks directly from the project, so as not to lose your work.

hashtag
Next Steps

  • Check the guide for tips on the most useful operations and features in JupyterLab.

Running JupyterLab
download input data from a project
terminal
upload it to the project
local execution environment
upload your local notebooks to the project
References
%%bash
dx download input_data/reads.fastq
%%bash
dx upload results.csv

FreeSurfer in JupyterLab

Learn how to use FreeSurfer in JupyterLab.

circle-info

JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

hashtag
About FreeSurfer

​FreeSurfer is a software package for the analysis and visualization of structural and functional neuroimaging data from cross-sectional or longitudinal studies.

The FreeSurfer package comes pre-installed with the IMAGE_PROCESSING .

hashtag
FreeSurfer License Registration

To use FreeSurfer on the DNAnexus Platform, you need a valid FreeSurfer license. You can register for the FreeSurfer license at the .

hashtag
Using the FreeSurfer License on DNAnexus

To use the FreeSurfer license, complete the following steps:

  1. Upload the license text file to your project on the DNAnexus Platform.

  2. Launch the JupyterLab app and specify the IMAGE_PROCESSING feature.

  3. Once JupyterLab is running, open your existing notebook (or a new notebook) and download the license file into the FREESURFER_HOME directory.

The commands to download the license file are as follows:

  • Python kernel: !dx download license.txt -o $FREESURFER_HOME

  • Bash kernel: dx download license.txt -o $FREESURFER_HOME

MONAI in JupyterLab

Using MONAI Core, MONAI Label/3D Slicer (SlicerJupyter) via JupyterLab

Medical Open Network for AI () is a framework built for deep learning in healthcare imaging. To use MONAI on the DNAnexus Platform, with the MONAI_ML feature, which includes:

  • : PyTorch-based framework for deep learning in healthcare imaging.

  • : An intelligent image labeling and learning tool designed to create training datasets and build AI annotation models. It provides a server-client framework that integrates with imaging viewers.

feature of JupyterLab
FreeSurfer registration pagearrow-up-right
  • 3D Slicerarrow-up-right: An open-source software designed for the visualization, processing, and analysis of medical, biomedical, and other 3D images. In a Jupyter environment, 3D Slicer is accessible through the SlicerJupyterarrow-up-right kernel and acts as a client for the MONAI Label server.

  • The MONAI Core, MONAI Label, and 3D Slicer (SlicerJupyter) come pre-installed with the JupyterLab MONAI_ML feature option.

    circle-check

    For the full list of pre-installed packages, see the JupyterLab in-product documentationarrow-up-right.

    hashtag
    Using MONAI Core

    For sample Jupyter notebooks and tutorials, see the official project MONAI tutorialsarrow-up-right.

    You can find technical documentation for MONAI Corearrow-up-right.

    hashtag
    Using MONAI Label with 3D Slicer

    For examples showing how to use 3D Slicer with MONAI Label, see the following sample Jupyter notebooks in DNAnexus OpenBio repository:

    • Radiology Auto-Segmentation and Training with MONAI Label and 3D Slicer (NIfTI/CT)arrow-up-right: Demonstrates auto-segmentation and model training on NIfTI CT spleen data using MONAI Label and 3D Slicer (SlicerJupyter).

    • Whole Brain Segmentation with MONAI Label and 3D Slicer (DICOM/MRI)arrow-up-right: Shows auto-segmentation and model training on DICOM MRI brain data, including DICOM-to-NIfTI conversion and interactive annotation in 3D Slicer.

    For general examples and tutorials on using MONAI Label and 3D Slicer (SlicerJupyter), explore the following GitHub repositories:

    • MONAI Label tutorials: Project-MONAI/tutorials/monailabelarrow-up-right

    • 3D Slicer (SlicerJupyter) example notebooks: Slicer/SlicerNotebooksarrow-up-right

    MONAIarrow-up-right
    run JupyterLab
    MONAI Corearrow-up-right
    MONAI Labelarrow-up-right

    Running JupyterLab

    Learn to launch a JupyterLab session on the DNAnexus Platform, via the JupyterLab app.

    circle-info

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    For DNAnexus Platform users, a license is required to access JupyterLab. Contact DNAnexus Salesenvelope for more information.

    hashtag
    Running from the UI

    1. In the main menu, navigate to Tools > JupyterLab. If you have used JupyterLab before, the page shows your previous sessions across different projects.

    2. Click New JupyterLab.

    3. Configure your JupyterLab session:

    circle-info

    Snapshots created using older versions of JupyterLab are incompatible with the current version. If you need to use an older JupyterLab snapshot, see .

    For a detailed list of libraries included in each feature option, see the .

    hashtag
    Running JupyterLab from the CLI

    You can start the JupyterLab environment directly from the command line by running the app:

    Once the app starts, you may check if the JupyterLab server is ready to server connections, which is indicated by the job's property httpsAppState set to running. Once it is running, you can open your browser and go to https://job-xxxx.dnanexus.cloud where job-xxxx is the ID of the job running the app.

    To run the Spark version of the app, use the command:

    You can check the optional input parameters for the apps on the DNAnexus Platform (platform login required to access the links):

    From the CLI, you can learn more about dx run with the following command:

    where APP_NAME is either app-dxjupyterlab or app-dxjupyterlab_spark_cluster.

    hashtag
    Next Steps

    See the and pages for more details on how to use JupyterLab.

    Specify the session name and select an instance type.

  • Choose the project where JupyterLab should run.

  • Set the session duration after which the environment automatically shuts down.

  • Optionally, provide a snapshot file to load a previously saved environment.

  • If needed, enable Spark Cluster and set the number of nodes.

  • Select a feature option based on your analysis needs:

    • PYTHON_R (default): Python3 and R kernel and interpreter

    • ML: Python3 with machine learning packages (TensorFlow, PyTorch, CNTK) and image processing (Nipype), but no R

    • IMAGE_PROCESSING: Python3 with image processing packages (Nipype, FreeSurfer, FSL), but no R. requires a license. GUI viewers such as fsleyes and freeview cannot be launched in the headless environment.

    • STATA: Stata requires a

    • MONAI_ML: Extends the ML feature with specialized medical imaging frameworks, such as MONAI Core, MONAI Label, and 3D Slicer.

  • Review the pricing estimate (if you have billing access) based on your selected duration and instance type.

  • Click Start Environment to launch your session. The JupyterLab shows an "Initializing" state while the worker spins up and the server starts.

  • Open your JupyterLab environment by clicking the session name link once the state changes to "Ready". You can also access it directly via https://job-xxxx.dnanexus.cloud, where job-xxxx is your job ID.

  • environment snapshot guidelines
    in-product documentationarrow-up-right
    JupyterLab Apparrow-up-right
    JupyterLab Spark Cluster Apparrow-up-right
    Quickstart
    References

    Using JupyterLab

    Use Jupyter notebooks on the DNAnexus Platform to craft sophisticated custom analyses in your preferred coding language.

    circle-info

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    Jupyter notebooksarrow-up-right are a popular way to track the work performed in computational experiments the way a lab notebook tracks the work done in a wet lab setting. JupyterLab is an application provided by DNAnexus that allows you to perform computational experiments on the DNAnexus Platform using Jupyter notebooks. JupyterLab allows users on the DNAnexus Platform to collaborate on notebooks and extends with options for directly accessing a DNAnexus project from the JupyterLab environment.

    hashtag
    Why Use JupyterLab?

    JupyterLab supports the use of Bioconductor and Bioconda, useful tools for bioinformatics analysis.

    JupyterLab is a versatile application that can be used to:

    • Collaborate on exploratory analysis of data

    • Reproduce and fork work performed in computational analyses

    • Visualize and gain insights into data generated from biological experiments

    The DNAnexus Platform offers two different JupyterLab apps. One is a general-purpose JupyterLab application. The other is Spark cluster-enabled, and can be used within the framework.

    Both apps instantiate a JupyterLab server that allows for data analyses to be interactively performed in Jupyter notebooks on a DNAnexus worker.

    The app contains all the features found in the general-purpose JupyterLab along with access to a fully-managed, on-demand Spark cluster for big data processing and translational informatics.

    hashtag
    Version Information

    JupyterLab 2.2 is the default version on the DNAnexus Platform. .

    hashtag
    Creating Interactive Notebooks

    A step-by-step guide on how to start with JupyterLab and create and edit Jupyter notebooks can be found in the .

    hashtag
    JupyterLab Environments

    Creating a JupyterLab session requires the use of two different environments:

    1. The DNAnexus project (accessible through the web platform and the CLI).

    2. The worker execution environment.

    hashtag
    The Project on the DNAnexus Platform

    You have direct access to the project in which the application is run from the JupyterLab session. The project file browser (which lists folders, notebooks, and other files in the project) can be accessed from the DNAnexus tab in the left sidebar or from the :

    The project is selected when the JupyterLab app is started and cannot be subsequently changed.

    The DNAnexus file browser shows:

    • Up to 1,000 of your most recently modified files and folders

    • All Jupyter notebooks in the project

    • Databases (Spark-enabled app only, limited to 1,000 most recent)

    The file list refreshes automatically every 10 seconds. You can also refresh manually by clicking the circular arrow icon in the top right corner.

    circle-check

    Need to see more files? Use dx ls in the terminal or access them programmatically through the API.

    hashtag
    Worker Execution Environment

    When you open and run a notebook from the the kernel corresponding to this notebook is started in the worker execution environment and is used to execute the notebook code. DNAnexus notebooks have a [DX] prepended to the notebook name in the tab of all opened notebooks.

    The execution environment file browser is accessible from the left sidebar (notice the folder icon at the top) or from the terminal:

    To create Jupyter notebooks in the worker execution environment, use the File menu. These notebooks are stored on the local file system of the JupyterLab execution environment and require persistence in a DNAnexus project. More information about saving appears in the .

    hashtag
    Local vs. DNAnexus Notebooks

    hashtag
    DNAnexus Notebooks

    You can directly in the DNAnexus project as well as duplicate, delete, or download them to your local machine. Notebooks stored in your DNAnexus project, which are housed within the DNAnexus tab on the left sidebar, are fetched from and saved to the project on the DNAnexus Platform without being stored in the JupyterLab execution environment file system. These are referred to as "DNAnexus notebooks" and these notebooks persist in the DNAnexus project after the JupyterLab instance is terminated.

    DNAnexus notebooks can be recognized by the [DX] that is prepended to its name in the tab of all opened notebooks.

    DNAnexus notebooks can be created by clicking the DNAnexus Notebook icon from the Launcher tab that appears on starting the JupyterLab session, or by clicking the DNAnexus tab on the upper menu and then clicking "New notebook". The Launcher tab can also be opened by clicking File and then selecting "New Launcher" from the upper menu.

    hashtag
    Local Notebooks

    To create a new local notebook, click the File tab in the upper menu and then select "New" and then "Notebook". These non-DNAnexus notebooks can be saved to DNAnexus by dragging and dropping them in the DNAnexus file viewer in the left panel.

    hashtag
    Accessing Data

    In JupyterLab, users can access input data that is located in a DNAnexus project in one of the following ways.

    • For reading the input file multiple times or for reading a large fraction of the file in random order:

      • Download the file from the DNAnexus project to the execution environment with dx download and access the downloaded local file from Jupyter notebook.

    hashtag
    Uploading Data

    Files, such as local notebooks, can be persisted in the DNAnexus project by using one of these options:

    • dx upload in bash console.

    • Drag the file onto the DNAnexus tab that is in the column of icons on the left side of the screen. This uploads the file into the selected DNAnexus folder.

    hashtag
    Exporting DNAnexus Notebooks

    Exporting DNAnexus notebooks to formats such as HTML or PDF is not supported. However, you can dx download the DNAnexus notebook from the current DNAnexus project to the JupyterLab environment and export the downloaded notebook. For exporting local notebook to certain formats, the following commands might be needed beforehand: apt-get update && apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic.

    hashtag
    Non-Interactive Execution of Notebooks

    A command can be executed in the JupyterLab worker execution environment without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array to the JupyterLab app. The provided command runs in the /opt/notebooks/directory and any output files generated in this directory are uploaded to the project and returned in the out output field of the job that ran the JupyterLab app.

    The cmd input makes it possible to use the papermill command that is pre-installed in the JupyterLab environment to execute notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:

    Where notebook.ipynb is the input notebook to the papermill command, which is passed to the dxjupyterlab app using the in input, and output_notebook.ipynb is the name of the output notebook, which contains the result of executing the input notebook and is uploaded to the project at the end of app's execution. See the for details.

    hashtag
    Collaboration in the Cloud

    Collaborators can work on notebooks in the project without the risk of overwriting each other's changes.

    hashtag
    Notebook Locking During Editing

    If a user has opened a specific notebook in a JupyterLab session, other users cannot open or edit the notebook. This is indicated by a red lock icon next to the notebook's name.

    It is still possible to create a duplicate to see what changes are being saved in the locked notebook or to continue work on this "forked" version of the notebook. To copy a notebook, right-click on its name and select Duplicate. After a few seconds, a notebook with the same name and a "copy" suffix should appear in the project.

    Once the editing user closes the notebook, the lock is released and anybody else with access to the project can open it.

    hashtag
    Notebook Versioning

    Whenever a notebook is saved in the project, it is uploaded to the platform as a new file that replaces the previous version, that is, the file of the same name. The previous version is moved to the .Notebook_archive folder with a timestamp suffix added to its name and its ID is saved in the properties of the new file. Saving notebooks directly in the project ensures that your analyses are not lost when the JupyterLab session ends.

    circle-exclamation

    If a notebook saved to the project exceeds 20 MB, it may no longer open in JupyterLab and could trigger a "JSON Parse Error." To recover your code, open an earlier version from the .Notebook_archive folder, or download the notebook to your local machine and clear the notebook's outputs using a local Jupyter editor before re-uploading.

    hashtag
    Session Timeout Control

    JupyterLab sessions begin with a set duration and shut down automatically at the end of this period. The timeout clock appears in the footer on the right side and can be adjusted using the Update duration button. The session terminates at the set timestamp even if the JupyterLab webpage is closed. Job lengths have an upper limit of 30 days, which cannot be extended.

    A session can be terminated immediately from the top menu (DNAnexus > End Session).

    hashtag
    Environment Snapshots

    It is possible to save the current session environment and data and reload it later by creating a session snapshot (DNAnexus > Create Snapshot).

    A JupyterLab session is , and a session snapshot file is a tarball generated by saving the Docker container state (with the docker commit and docker save commands). Any installed packages and files created locally are saved to a snapshot file, except for directories /home/dnanexus and /mnt/, which are not included. This file is then uploaded to the project to .Notebook_snapshots and can be passed as input the next time the app is started.

    circle-info

    If many large files are created locally, the resulting snapshots take a long time to save and load. In general, it is recommended not to snapshot more than 1 GB of locally saved data/packages and rely on downloading larger files as needed.

    hashtag
    Snapshots Created in Older Versions of JupyterLab

    Snapshots created with JupyterLab versions older than 2.0.0 (released mid-2023) are not compatible with the current version. These previous snapshots contain tool versions that may conflict with the newer environment, potentially causing problems.

    hashtag
    Using Previous Snapshots in the Current Version of JupyterLab

    To use a snapshot from a previous version in the current version of JupyterLab, recreate the snapshot as follows:

    1. Create a tarball incorporating all the necessary data files and packages.

    2. Save the tarball in a project.

    3. Launch the current version of JupyterLab.

    hashtag
    Accessing an Older Snapshot in an Older Version of JupyterLab

    If you don't want to have to recreate your older snapshot, you can run an and access the snapshot there.

    hashtag
    Viewing Other Files in the Project

    Viewing any other file types from your project, such as CSV, JSON, PDF files, images, or scripts, is convenient because JupyterLab displays them accordingly. For example, JSON files are collapsible and navigable and CSV files are presented in the tabular format.

    However, editing and saving any open files from the project other than IPython notebooks results in an error.

    circle-exclamation

    Files larger than 20 MB display only their metadata in the JupyterLab file viewer. To access the full contents of a large file, download it using or the , or use the DNAnexus file browser on the platform.

    hashtag
    Permissions in the JupyterLab Session

    The JupyterLab apps are run in a specific project, defined at start time, and this project cannot be subsequently changed. The job associated with the JupyterLab app has CONTRIBUTE access to the project in which it is run.

    When running the JupyterLab app, it is possible to view, but not update, other projects the user has access to. This enhanced scope is required to be able to read databases which may be located in different projects and cannot be cloned.

    hashtag
    Running Jobs in the JupyterLab Session

    Use dx run to start new jobs from within a notebook or the terminal. If the billTo for the project where your JupyterLab session runs does not have a license for detached executions, any started jobs run as subjobs of your interactive JupyterLab session. In this situation, the --project argument for dx run is ignored, and the job uses the JupyterLab session's workspace instead of the specified project. If a subjob fails or terminates on the DNAnexus Platform, the entire job tree—including your interactive JupyterLab session—terminates as well.

    circle-exclamation

    Jobs are limited to a runtime of 30 days. The system automatically terminates jobs running longer than 30 days.

    hashtag
    Environment and Feature Options

    The JupyterLab app is a Docker-based app that runs the JupyterLab server instance in a Docker container. The server runs on port 443. Because it's an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app. Only the user who launched the JupyterLab job has access to the JupyterLab environment. Other users see a "403 Permission Forbidden" message under the JupyterLab session's URL.

    circle-info

    On the DNAnexus Platform, the JupyterLab server runs in a Python 3.9.16 environment, in a container running Ubuntu 20.04 as its operating system.

    hashtag
    Feature Options

    When launching JupyterLab, the feature options available are PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML.

    • PYTHON_R (default option): Loads the environment with Python3 and R kernel and interpreter.

    • ML: Loads the environment with Python3 and machine learning packages, such as TensorFlow, PyTorch, CNTK as well as the image processing package Nipype, but it does not contain R.

    • IMAGE_PROCESSING

    circle-info

    The JupyterLab environment is headless and command-line only. While FSL and FreeSurfer command-line tools are available for batch processing, GUI viewers such as fsleyes and freeview cannot be launched. To visualize results interactively, download the output files to your local machine.

    • STATA: Requires a license to run. See for more information about running Stata in JupyterLab.

    • MONAI_ML: Loads the environment with Python3 and extends the ML feature. This feature is ideal for medical imaging research involving machine learning model development and testing. It includes medical imaging frameworks designed for AI-powered analysis. For details, see .

    circle-check

    For the full list of pre-installed packages, see the . This list includes details on feature-specific packages available when running the PYTHON_R, ML, IMAGE_PROCESSING, STATA, and MONAI_ML features.

    hashtag
    Installing Additional Packages

    Additional packages can be during a JupyterLab session. By creating a Docker container , users can then start subsequent sessions with the new packages pre-installed by providing the snapshot as input.

    hashtag
    JupyterLab Documentation

    For more information on the features and benefits of JupyterLab, see the .

    hashtag
    Next Steps

    • Create your first notebooks by following the instructions in the guide.

    • See the guide for tips and info on the most useful JupyterLab features.

    Running Older Versions of JupyterLab

    Learn how to run an older version of JupyterLab via the user interface or command-line interface.

    hashtag
    Why Run an Older Version of JupyterLab?

    The primary reason to run an older version of JupyterLab is to access snapshots containing tools that cannot be run in the current version's execution environment.

    dx run app-dxjupyterlab
    dx run app-dxjupyterlab_spark_cluster
    dx run -h APP_NAME
    hashtag
    Launching an Older Version via the User Interface (UI)
    1. From the main Platform menu, select Tools, then Tools Library.

    2. Find and select, from the list of tools, either JupyterLab with Python, R, Stata, ML, Image Processing or JupyterLab with Spark Cluster.

    3. From the tool detail page, click on the Versions tab.

    4. Select the version you'd like to run. Click the Run button.

    hashtag
    Launching an Older Version via the Command-Line Interface (CLI)

    1. Select the project in which you want to run JupyterLab.

    2. Launch the version of JupyterLab you want to run, substituting the version number for x.y.z in the following commands:

      • For JupyterLab without the Spark cluster capability, run the command dx run app-dxjupyterlab/x.y.z --priority high.

      • For JupyterLab with the Spark cluster capability, run the command dx run app-dxjupyterlab_spark_cluster/x.y.z --priority high

    circle-info

    Running JupyterLab at "high" priority is not required. However, doing so ensures that your interactive session is not interrupted by spot instance termination.

    hashtag
    Accessing JupyterLab

    After launching JupyterLab, access the JupyterLab environment using your browser. To do this:

    1. Get the job ID for the job created when you launched JupyterLab. See the Monitoring Executions page for details on how to get the job ID, via either the UI or the CLI.

    2. Open the URL https://job-xxxx.dnanexus.cloud, substituting the job's ID for job-xxxx.

    3. You may see an error message "502 Bad Gateway" if JupyterLab is not yet accessible. If this happens, wait a few minutes, then try again.

    FreeSurfer
    license to run
    Create figures and tables for scientific publications
  • Build and test algorithms directly in the cloud before creating DNAnexus apps and workflows

  • Test and train machine/deep learning models

  • Interactively run commands on a terminal

  • For scanning the content of the input file once or for reading only a small fraction of file's content:
    • A project in which the app is running is mounted in a read-only fashion at /mnt/project folder. Reading the content of the files in /mnt/project dynamically fetches the content from the DNAnexus Platform, so this method uses minimal disk space in the JupyterLab execution environment, but uses more API calls to fetch the content.

    Import and unpack the tarball file.
  • Create a snapshot of the JupyterLab environment.

  • : Loads the environment with Python3 and Image Processing packages such as Nipype, FreeSurfer and FSL but it does not contain R. The FreeSurfer package requires a license to run. Details about license creation and usage can be found in the
    .
    JupyterLabarrow-up-right
    DNAnexus Apollo
    JupyterLab Spark Cluster
    Previous versions remain available
    Quickstart
    terminal
    DNAnexus file browser
    following section
    create, edit, and save notebooks
    JupyterLab app pagearrow-up-right
    run in a Docker container
    older version of JupyterLab
    dx download
    dxpy client library
    Stata in JupyterLab
    MONAI in JupyterLab
    JupyterLab in-product documentationarrow-up-right
    installed
    snapshot
    official JupyterLab documentationarrow-up-right
    Quickstart
    JupyterLab Reference
    FreeSurfer in JupyterLab guide
    my_cmd="papermill notebook.ipynb output_notebook.ipynb"
    dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"

    Spark Cluster-Enabled JupyterLab

    Learn to use the JupyterLab Spark Cluster app.

    circle-info

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access JupyterLab on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

    hashtag
    Overview

    The JupyterLab Spark Cluster app is a that runs a fully-managed standalone Spark/Hadoop cluster. This cluster enables distributed data processing and analysis from directly within the JupyterLab application. In the JupyterLab session, you can interactively create and query DNAnexus databases or run any analysis on the Spark cluster.

    Besides the core JupyterLab features, the Spark cluster-enabled JupyterLab app allows you to:

    • Explore the available databases and get an overview of the available datasets

    • Perform analyses and visualizations directly on data available in the database

    • Create databases

    Check the general for an introduction to JupyterLab.

    hashtag
    Running and Using JupyterLab Spark Cluster

    The page contains information on how to start a JupyterLab session and create notebooks on the DNAnexus Platform. The page has additional useful tips for using the environment.

    hashtag
    Instantiating the Spark Context

    Having created your notebook in the project, you can populate your first cells as below. It is good practice to instantiate your Spark context at the beginning of your analyses, as shown below.

    hashtag
    Basic Operations on DNAnexus Databases

    hashtag
    Exploring Existing Databases

    To view any databases to which you have access to in your current region and project context, run a cell with the following code:

    A sample output should be:

    You can inspect one of the returned databases by running:

    which should return an output similar to:

    To find a database in your current region that may be in a different project than your current context, run the following code:

    A sample output should be:

    To inspect one of the databases listed in the output, use the unique database name. If you use only the database name, results are limited to the current project. For example:

    hashtag
    Creating Databases

    Here's an example of how to create and populate your own database:

    You can separate each line of code into different cells to view the outputs iteratively.

    hashtag
    Using Hail

    is an open-source, scalable framework for exploring and analyzing genomic data. It is designed to run primarily on a Spark cluster and is available with JupyterLab Spark Cluster. It is included in the app and can be used when the app is run with the feature input set to HAIL (set as default).

    Initialize the context when beginning to use Hail. It's important to pass previously started Spark Context sc as an argument:

    We recommend continuing your exploration of Hail with the . For example:

    hashtag
    Using VEP with Hail

    To use (Ensembl Variant Effect Predictor) with Hail, select "Feature," then "HAIL" when launching Spark Cluster-Enabled JupyterLab via the CLI.

    VEP can predict the functional effects of genomic variants on genes, transcripts, protein sequences, and regulatory regions. This includes the , which is activated when using the configuration file below.

    Add the following JSON configuration file to your DNAnexus project:

    Once the vep-GRCh38.json file is in your project, you can annotate the Hail MatrixTable (mt) using the following command:

    hashtag
    Behind the Scenes

    The Spark cluster app is a Docker-based app which runs the JupyterLab server in a Docker container.

    The JupyterLab instance runs on port 443. Because it is an HTTPS app, you can bring up the JupyterLab environment in a web browser using the URL https://job-xxxx.dnanexus.cloud, where job-xxxx is the ID of the job that runs the app.

    The script run at the instantiation of the container, /opt/start_jupyterlab.sh, configures the environment and starts the server needed to connect to the Spark cluster. The environment variables needed are set by sourcing two scripts, bind-mounted into the container:

    The default user in the container is root.

    The option --network host is used when starting Docker to remove the network isolation between the host and the Docker container, which allows the container to bind to the host's network and access Sparks master port directly.

    hashtag
    Accessing AWS S3 Buckets

    S3 buckets can have private or public access. Either the s3 or the s3a scheme can be used to access S3 buckets. The s3 scheme is automatically aliased to s3a in all Apollo Spark Clusters.

    hashtag
    Public Bucket Access

    To access public s3 buckets, you do not need to have s3 credentials. The example below shows how to access the public 1000Genomes bucket in a JupyterLab notebook:

    When the above is run in a notebook, the following is displayed:

    hashtag
    Private Bucket Access

    To access private buckets, see the example code below. The example assumes that a Spark session has been created as shown above.

    Stata in JupyterLab

    Using Stata via JupyterLab, working with project files, and creating datasets with Spark.

    Stataarrow-up-right is a powerful statistics package for data science. Stata commands and functionality can be accessed on the DNAnexus Platform via stata_kernelarrow-up-right, in Jupyter notebooks.

    hashtag
    Before You Begin

    hashtag
    Project License Requirement

    On the DNAnexus Platform, use the to create and edit Jupyter notebooks.

    circle-info

    You can only run this app within a project that's billed to an account with a license that allows the use of both and . if you need to upgrade your license.

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment. A license is required to access JupyterLab on the DNAnexus Platform. for more information.

    hashtag
    Stata License Requirement

    To use Stata on the DNAnexus Platform, you need a valid Stata license. Before launching Stata in a project, you must save your license details according to the instructions below in a plain text file with the extension .json, then upload this file to the project's root directory. You only need to do this once per project.

    hashtag
    Creating a Stata License Details File

    Start by creating the file in a text editor, including all the fields shown here, where <user> is your DNAnexus username, and<organization>is the org of which you're a member:

    Save the file according to the following format, where <username> is your DNAnexus username: .stataSettings.user-<username>.json

    circle-info

    Some operating systems may not support the naming of files with a "." as the first character. If this is the case, you can rename the .json file after uploading it to your project by hovering over the name of your file and clicking the pencil icon that appears.

    hashtag
    Uploading the Stata License Details File

    Open the project in which you want to use Stata. Upload the Stata license details file to the project's root directory by going to your project's Manage tab, clicking on the Add button on the upper right, and then selecting the Upload data option.

    hashtag
    Secure Indirect Format Option for Shared Projects

    When working in a shared project, you can take an additional step to avoid exposing your Stata license details to project collaborators.

    Create a private project. Then create and save a Stata license details file in that project's root directory, per the instructions above.

    Within the shared project, create and save a Stata license details file in this format, where project-yyyy is the name of the private project, and file-xxxx is the license details file ID, in that private project:

    circle-info

    When working on the Research Analysis Platform, you can only create a private credentials project from the .

    hashtag
    Launching JupyterLab

    1. Open the project in which you want to use Stata. From within the project's Manage tab, click the Start Analysis button.

    2. Select the app JupyterLab with Python, R, Stata, ML, Image Processing.

    3. Click the Run Selected button. If you haven't run this app before, you are prompted to install it. Next, you are taken to the Run Analysis screen.

    circle-info

    The app can take some time to load and start running.

    Once the analysis starts, you see the notification "Running" appear under the name of the app.

    hashtag
    Opening JupyterLab

    Click the Monitor tab heading. This opens a list of running and past jobs. Jobs are shown in reverse chronological order, with the most recently launched at the top. The topmost row should show the job you launched. To open the job and enter the JupyterLab interface, click on the URL shown under Worker URL.

    circle-info

    If you do not see the worker URL, click on the name of the job in the Monitor page.

    hashtag
    Using Stata Within JupyterLab

    Within the JupyterLab interface, open the DNAnexus tab shown at the left edge of the screen.

    Open a new Stata notebook by clicking the Stata tile in the Notebooks section.

    hashtag
    Working with Project Files

    You can download DNAnexus data files to the JupyterLab container from Stata notebook with:

    Data files in the current project can also be accessed using a /mnt/project folder from a Stata notebook as follows: To load a DTA file:

    To load a CSV file:

    To write a DTA file to the JupyterLab container:

    To write a CSV file to the JupyterLab container:

    To upload a data file from the JupyterLab container to the project, use the following command in a Stata notebook:

    Alternatively, open a new Launcher tab, open Terminal, and run:

    The /mnt/project directory is read-only, so trying to write to it results in an error.

    hashtag
    Creating a Stata Dataset with Spark

    can be used to query and filter DNAnexus returning a PySpark DataFrame. PySpark Dataframe can be converted to a pandas DataFrame with:

    Pandas dataframe can be exported to CSV or Stata DTA files in the JupyterLab container with:

    To upload a data file from the JupyterLab container to the DNAnexus project in the JupyterLab Spark Cluster app, use

    Once saved to the project, data files can be used in a JupyterLab Stata session using the instructions above.

    JupyterLab Reference

    This page is a reference for most useful operations and features in the JupyterLab environment.

    circle-info

    JupyterLab is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.

    A license is required to access JupyterLab on the DNAnexus Platform. for more information.

    Submit data analysis jobs to the Spark cluster
    Spark application
    Overview
    Quickstart
    References
    Hailarrow-up-right
    GWAS using Hail tutorialarrow-up-right
    VEParrow-up-right
    LOFTEE pluginarrow-up-right
  • On the Run Analysis screen, open the Analysis Inputs tab and click the Stata settings file button.

  • Add your Stata settings file as an input. This is the .json file you created, containing your Stata license details.

  • In the Common section at the bottom of the Analysis Inputs pane, open the Feature dropdown menu and select Stata.

  • Click the Start Analysis button at the top right corner of the screen. This launches the JupyterLab app, and takes you to the project's Monitor tab, where you can monitor the app's status as it loads.

  • JupyterLab app
    JupyterLab
    HTTPS apps
    Contact DNAnexus Salesenvelope
    Contact DNAnexus Salesenvelope
    Research Analysis Platform Projects pagearrow-up-right
    JupyterLab Spark Cluster app
    datasets
    The location of the DNAnexus tab within the JupyterLab interface.
    import pyspark
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    spark.sql("show databases").show(truncate=False)
    +------------------------------------------------------------+
    |namespace                                                   |
    +------------------------------------------------------------+
    |database_xxxx__brca_pheno                                   |
    |database_yyyy__gwas_vitamind_chr1                           |
    |database_zzzz__meta_data                                    |
    |database_tttt__genomics_180820                              |
    +------------------------------------------------------------+
    db = "database_xxxx__brca_pheno"
    spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)
    +------------------------------------+-----------+-----------+
    |namespace                           |tableName  |isTemporary|
    +------------------------------------+-----------+-----------+
    |database_xxxx__brca_pheno           |cna        |false      |
    |database_xxxx__brca_pheno           |methylation|false      |
    |database_xxxx__brca_pheno           |mrna       |false      |
    |database_xxxx__brca_pheno           |mutations  |false      |
    |database_xxxx__brca_pheno           |patient    |false      |
    |database_xxxx__brca_pheno           |sample     |false      |
    +------------------------------------+-----------+-----------+
    show databases like "<project_id_pattern>:<database_name_pattern>";
    show databases like "project-*:<database_name>";
    +------------------------------------------------------------+
    |namespace                                                   |
    +------------------------------------------------------------+
    |database_xxxx__brca_pheno                                   |
    |database_yyyy__gwas_vitamind_chr1                           |
    |database_zzzz__meta_data                                    |
    |database_tttt__genomics_180820                              |
    +------------------------------------------------------------+
    db = "database_xxxx__brca_pheno"
    spark.sql(f"SHOW TABLES FROM {db}").show(truncate=False)
    # Create a database
    my_database = "my_database"
    spark.sql("create database " + my_database + " location 'dnax://'")
    spark.sql("create table " + my_database + ".foo (k string, v string) using parquet")
    spark.sql("insert into table " + my_database + ".foo values ('1', '2')")
    sql("select * from " + my_database + ".foo")
    import hail as hl
    hl.init(sc=sc)
    # Download example data from 1k genomes project and inspect the matrix table
    hl.utils.get_1kg('data/')
    hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)
    mt = hl.read_matrix_table('data/1kg.mt')
    mt.rows().select().show(5)
    vep-GRCh38.json
    {
      "command": [
        "docker",
        "run",
        "-i",
        "-v",
        "/cluster/vep:/root/.vep",
        "dnanexus/dxjupyterlab-vep",
        "./vep",
        "--format",
        "vcf",
        "__OUTPUT_FORMAT_FLAG__",
        "--everything",
        "--allele_number",
        "--no_stats",
        "--cache",
        "--offline",
        "--minimal",
        "--assembly",
        "GRCh38",
        "-o",
        "STDOUT",
        "--check_existing",
        "--dir_cache",
        "/root/.vep/",
        "--dir_plugins",
        "/root/.vep/Plugins/loftee",
        "--fasta",
        "/root/.vep/homo_sapiens/109_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
        "--plugin",
        "LoF,loftee_path:/root/.vep/Plugins/loftee,human_ancestor_fa:/root/.vep/human_ancestor.fa,conservation_file:/root/.vep/loftee.sql,gerp_bigwig:/root/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw"
      ],
      "env": {
        "PERL5LIB": "/root/.vep/Plugins"
      },
      "vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele: String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:      String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:  String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:            Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:          Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:   String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:   String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:                        Array[Struct{allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],    distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:   String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,             sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"
    }
    # Annotation process relies on "dnanexus/dxjupyterlab-vep" docker container
    # as well as VEP and LoF resources that are pre-installed on every Spark node when
    # HAIL-VEP feature is selected.
    annotated_mt = hl.vep(mt, "file:///mnt/project/vep-GRCh38.json")
    source /home/dnanexus/environment
    source /cluster/dx-cluster.environment
    #read csv from public bucket
    df = spark.read.options(delimiter='\t', header='True', inferSchema='True').csv("s3://1000genomes/20131219.populations.tsv")
    df.select(df.columns[:4]).show(10, False)
    #access private data in S3 by first unsetting the default credentials provider
    sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', '')
    
    # replace "redacted" with your keys
    sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'redacted')
    sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'redacted')
    df=spark.read.csv("s3a://your_private_bucket/your_path_to_csv")
    df.select(df.columns[:5]).show(10, False)
    {
      "license": {
        "serialNumber": "<Serial number from Stata>",
        "code": "<Code from Stata>",
        "authorization": "<Authorization from Stata>",
        "user": "<Registered user line 1>",
        "organization": "<Registered user line 2>"
      }
    }
    {
      "licenseFile": {
        "$dnanexus_link": {
          "id": "file-xxxx",
          "project": "project-yyyy"
        }
      }
    }
    !dx download project-xxxx:file-yyy
    use /mnt/project/<path>/data_in.dta
    import delimited /mnt/project/<path>/data_in.csv
    save data_out
    export delimited data_out.csv
    !dx upload <file> --destination=<destination>
    dx upload <file> --destination=<destination>
    pandas_df = spark_df.toPandas()
    pandas_df.to_stata("data_out.dta")
    pandas_df.to_csv("data_out.csv")
    %%bash
    dx upload <file>
    hashtag
    Download Files from the Project to the Local Execution Environment

    hashtag
    Bash

    You can download input data from a project using dx download in a notebook cell:

    The %%bash keyword converts the whole cell to a magic cell which allows you to run bash code in that cell without exiting the Python kernel. See examples of magic commands in the IPython documentationarrow-up-right. The ! prefix achieves the same result:

    Alternatively, the dx command can be executed from the terminal.

    hashtag
    Python

    To download data with Python in the notebook, you can use the download_dxfilearrow-up-right function:

    Check the dxpy helper functionsarrow-up-right for details on how to download files and folders.

    hashtag
    Upload Data from the Session to the Project

    hashtag
    Bash

    Any files from the execution environment can be uploaded to the project using dx upload:

    hashtag
    Python

    To upload data using Python in the notebook, you can use the upload_local_filearrow-up-right function:

    Check the dxpy helper functionsarrow-up-right for details on how to upload files and folders.

    hashtag
    Download and Upload Data to Your Local Machine

    By selecting a notebook or any other file on your computer and dragging it into the DNAnexus project file browser, you can upload the files directly to the project. To download a file, right-click on it and click Download (to local computer).

    You may upload and download data to the local execution environment in a similar way, that is, by dragging and dropping files to the execution file browser or by right-clicking on the files there and clicking Download.

    hashtag
    Use the Terminal

    It is useful to have a terminal provided by JupyterLab at hand, which uses bash shell by default and lets you execute shell scripts or interact with the platform via dx toolkit. For example, the following command confirms what the current project context is:

    Running pwd shows you that the working directory of the execution environment is /opt/notebooks. The JupyterLab server is launched from this directory, which is also the default location of the output files generated in the notebooks.

    To open a terminal window, go to File > New > Terminal or open it from the Launcher (using the "Terminal" box at the bottom). To open a Launcher, select File > New Launcher.

    hashtag
    Install Custom Packages in the Session Environment

    You can install pip, conda, apt-get, and other packages in the execution environment from the notebook:

    By creating a snapshot, you can start subsequent sessions with these packages pre-installed by providing the snapshot as input.

    hashtag
    Access Public and Private GitHub Repositories from the JupyterLab Terminal

    You can access public GitHub repositories from the JupyterLab terminal using git clone command. By placing a private ssh key that's registered with your GitHub account in /root/.ssh/id_rsa you can clone private GitHub repositories using git clone and push any changes back to GitHub using git push from the JupyterLab terminal.

    Below is a screenshot of a JupyterLab session with a terminal displaying a script that:

    • sets up ssh key to access a private GitHub repository and clones it,

    • clones a public repository,

    • downloads a JSON file from the DNAnexus project,

    • modifies an open-source notebook to convert the JSON file to CSV format,

    • saves the modified notebook to the private GitHub repository,

    • and uploads the results of JSON to CSV conversion back to the DNAnexus project.

    This animation shows the first part of the script in action:

    hashtag
    Run Notebooks Non-Interactively

    A command can be run in the JupyterLab Docker container without starting an interactive JupyterLab server. To do that, provide the cmd input and additional input files using the in input file array. The command runs in the directory where the JupyterLab server is started and notebooks are run, that is, /opt/notebooks/. Any output files generated in this directory are uploaded to the project and returned in the out output.

    The cmd input makes it possible to use a papermill tool pre-installed in the JupyterLab environment that executes notebooks non-interactively. For example, to execute all the cells in a notebook and produce an output notebook:

    where notebook.ipynb is the input notebook to papermill, which needs to be passed in the in input, and output_notebook.ipynb is the name of the output notebook, which stores the result of the cells' execution. The output is uploaded to the project at the end of the app execution.

    If the snapshot parameter is specified, execution of cmd takes place in the specified Docker container. The duration argument is ignored when running the app with cmd. The app can be run from the command line with the --extra-args flag to limit the runtime, for example, dx run dxjupyterlab --extra-args '{"timeoutPolicyByExecutable": {"app-xxxx":{"\*": {"hours": 1}}}}'".

    If cmd is not specified, the in parameter is ignored and the output of an app consists of an empty array.

    hashtag
    Use newer NVIDIA GPU-accelerated software

    If you are trying to use newer NVIDIA GPU-accelerated software, you may find that the NVIDIA GPU Driver kernel-mode driver NVIDIA.ko that is installed outside of the JupyterLab environment does not support the newer CUDA version required by your application. You can install NVIDIA Forward Compatibilityarrow-up-right packages to use the newer CUDA version required by your application by following the steps below in a JupyterLab terminal.

    hashtag
    Session Inactivity

    After 15 to 30 minutes of inactivity in the JupyterLab browser tabs, the system logs you out automatically from the JupyterLab session and displays a "Server Connection Error" message. To re-enter the JupyterLab session, reload the JupyterLab webpage and log into the platform to be redirected to the JupyterLab session.

    Contact DNAnexus Salesenvelope

    Exploring and Querying Datasets

    circle-info

    A license is required to access Spark functionality on the DNAnexus Platform. for more information.

    hashtag
    Extracting Data From a Dataset With Spark

    %%bash
    dx download input_data/reads.fastq
    ! dx download input_data/reads.fastq
    import dxpy
    dxpy.download_dxfile(dxid='file-xxxx',
                         filename='unique_name.txt')
    %%bash
    dx upload Readme.ipynb
    import dxpy
    dxpy.upload_local_file('variants.vcf')
    $ dx pwd
    MyProject:/
    %%bash
    pip install torch
    pip install torchvision
    conda install -c conda-forge opencv
    my_cmd="papermill notebook.ipynb output_notebook.ipynb"
    dx run dxjupyterlab -icmd="$my_cmd" -iin="notebook.ipynb"
    # NVIDIA-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    // Let's upgrade CUDA 11.4 to 12.5
    # apt-get update
    # apt-get -y install cuda-toolkit-12-5 cuda-compat-12-5
    # echo /usr/local/cuda/compat > /etc/ld.so.conf.d/NVIDIA-compat.conf
    # ldconfig
    # NVIDIA-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 12.5     |
    |-------------------------------+----------------------+----------------------+
    // CUDA 12.5 is now usable from terminal and notebooks
    The dx commands, extract_dataset and extract_assay germline, let you either retrieve the data dictionary of a dataset or extract the underlying data described by that dictionary. You can also use these commands to get dataset metadata, such as the names and titles of entities and fields, or to list all relevant assays in a dataset.

    Often, you can retrieve data without using Spark, and extra compute resources are not required (see the example OpenBio notebooksarrow-up-right). However, if you need more compute power—such as when working with complex data models, large datasets, or extracting large volumes of data—you can use a private Spark resource. Using private compute resources helps avoid these timeouts by scaling resources as needed.

    If you use the --sql flag, the command returns a SQL statement (as a string) that you can use in a standalone Spark-enabled application, such as JupyterLab.

    hashtag
    Initiating a Spark Session

    The most common way to use Spark on the DNAnexus Platform is via a Spark enabled JupyterLab notebook.

    After creating a Jupyter notebook within a project, enter the commands shown below, to start a Spark session.

    Python:

    R:

    hashtag
    Executing SQL Queries

    Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:

    Python:

    R:

    hashtag
    Query to Extract Data From Database Using extract_dataset

    Python:

    Where dataset is the record-id or the path to the dataset or cohort, for example, "record-abc123" or "/mydirectory/mydataset.dataset."

    R:

    Where dataset is the record-id or the path to the dataset or cohort.

    hashtag
    Query to Filter and Extract Data from Database Using extract_assay germline

    Python:

    R:

    In the examples above, dataset is the record-id or the path to the dataset or cohort, for example, record-abc123 or /mydirectory/mydataset.dataset. allele_filter.json is a JSON object, as a file, and which contains filters for the --retrieve-allele command. For more information, refer to the notebooks in the DNAnexus OpenBio dx-toolkit examplesarrow-up-right.

    hashtag
    Run SQL Query to Extract Data

    Python:

    R:

    hashtag
    Best Practices

    • When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.

    • Ensure that your Spark session is only initialized once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - for example, run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session becomes corrupted and you need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.

    • If you want to use a database outside your project's scope, you must refer to it using its unique database name (typically this looks something like database_fjf3y28066y5jxj2b0gz4g85__metabric_data) as opposed to the database name (metabric_data in this case).

    Contact DNAnexus Salesenvelope
    import pyspark
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    install.packages("sparklyr")
    library(sparklyr)
    port <- Sys.getenv("SPARK_MASTER_PORT")
    master <- paste("spark://master:", port, sep = '')
    sc = spark_connect(master)
    retrieve_sql = 'select .... from .... '
    df = spark.sql(retrieve_sql)
    library(DBI)
    retrieve_sql <- 'select .... from .... '
    df = dbGetQuery(sc, retrieve_sql)
    import subprocess
    cmd = ["dx", "extract_dataset", dataset, "--fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o", "extracted_data.sql"]
    subprocess.check_call(cmd)
    cmd <- paste("dx extract_dataset", dataset, " --fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o extracted_data.sql")
    system(cmd)
    import subprocess
    cmd = ["dx", "extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o", "extract_allele.sql"]
    subprocess.check_call(cmd)
    cmd <- paste("dx extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o extracted_allele.sql")
    system(cmd)
    with open("extracted_data.sql", "r") as file:
        retrieve_sql=""
        for line in file:
            retrieve_sql += line.strip()
    df = spark.sql(retrieve_sql.strip(";"))
    install.packages("tidyverse")
    library(readr)
    retrieve_sql <-read_file("extracted_data.sql")
    retrieve_sql <- gsub("[;\n]", "", retrieve_sql)
    df <- dbGetQuery(sc, retrieve_sql)