Parallel by Region (py)

This applet tutorial performs a SAMtools count using parallel threads.

View full source code on GitHub

To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial:

  1. Install SAMtools

  2. Download BAM file

  3. Split workload

  4. Count regions in parallel

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends field.

{
  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

Download Inputs

This applet downloads all inputs at once using dxpy.download_all_inputs:

Split workload

This tutorial processes data in parallel using the Python multiprocessing module with a straightforward pattern shown below:

This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the Python docs.

The applet script includes helper functions to manage the workload. One helper is run_cmd, which manages subprocess calls:

Before splitting the workload, determine what regions are present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:

Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.

The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs are parsed from the workers to determine whether the run failed or passed.

Last updated

Was this helpful?