Parallel by Region (py)
This applet tutorial performs a SAMtools count using parallel threads.
View full source code on GitHub
To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial:
Install SAMtools
Download BAM file
Split workload
Count regions in parallel
How is the SAMtools dependency provided?
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends field.
{
"runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}Download Inputs
This applet downloads all inputs at once using dxpy.download_all_inputs:
Split workload
This tutorial processes data in parallel using the Python multiprocessing module with a straightforward pattern shown below:
This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the Python docs.
The applet script includes helper functions to manage the workload. One helper is run_cmd, which manages subprocess calls:
Before splitting the workload, determine what regions are present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:
Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.
The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs are parsed from the workers to determine whether the run failed or passed.
Last updated
Was this helpful?