Parallel by Region (py)
This applet tutorial performs a SAMtools count using parallel threads.
View full source code on GitHub
To take full advantage of the scalability that cloud computing offers, your scripts must implement the correct methodologies. This applet tutorial shows you how to:
Install SAMtools
Download BAM file
Split workload
Count regions in parallel
How is the SAMtools dependency provided?
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends field.
{
"runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}Download Inputs
This applet downloads all inputs at once using dxpy.download_all_inputs:
Split workload
Using the Python multiprocessing module, you can split the workload into multiple processes for parallel execution:
With this pattern, you can quickly orchestrate jobs on a worker. For a more detailed overview of the multiprocessing module, visit the Python docs.
Specific helpers are created in the applet script to manage the workload. One helper you may have seen before is run_cmd. This function manages the subprocess calls:
Before the workload can be split, you need to identify the regions present in the BAM input file. This initial parsing is handled in the parse_sam_header_for_region function:
Once the workload is split and processing has started, wait and review the status of each Pool worker. Then, merge and output the results.
The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. These outputs from the workers are parsed to determine whether the run failed or passed.
Last updated
Was this helpful?