Parallel by Chr (py)
This applet tutorial will perform a SAMtools count using parallel threads.
View full source code on GitHub
In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:
Install SAMtools
Download BAM file
Count regions in parallel
How is the SAMtools dependency provided?
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
runSpec.execDepends
.
For additional information, please refer to the execDepends
documentation.
Download BAM file
The dxpy.download_all_inputs()
function downloads all input files into the /home/dnanexus/in
directory. A folder will be created for each input and the file(s) will be downloaded to that directory. For convenience, the dxpy.download_all_inputs
function returns a dictionary containing the following keys:
<var>_path
(string): full absolute path to where the file was downloaded.<var>_name
(string): name of the file, including extention.<var>_prefix
(string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.
The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, our dictionary has the following key-value pairs:
Count Regions in Parallel
Before we can perform our parallel SAMtools count, we must determine the workload for each thread. We arbitrarily set our number of workers to 10
and set the workload per thread to 1
chromosome at a time. There are various ways to achieve multithreaded processing in python. For the sake of simplicity, we use multiprocessing.dummy
, a wrapper around Python’s threading module.
Each worker creates a string to be called in a subprocess.Popen
call. We use the multiprocessing.dummy.Pool.map(<func>, <iterable>)
function to call the helper function run_cmd
for each string in the iterable of view commands. Because we perform our multithreaded processing using subprocess.Popen
, we will not be alerted to any failed processes. We verify our closed workers in the verify_pool_status
helper function.
Important: In this example we use subprocess.Popen
to process and verify our results in verify_pool_status
. In general, it is considered good practice to use python’s built-in subprocess convenience functions. In this case, subprocess.check_call
would achieve the same goal.
Gather Results
Each worker returns a read count of just one region in the BAM file. We sum and output the results as the job output. We use the dx-toolkit python SDK’s dxpy.upload_local_file
function to upload and generate a DXFile corresponding to our result file. For python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json
and the values being the output values for corresponding output classes. For files, the output type is a DXLink. We use the dxpy.dxlink
function to generate the appropriate DXLink value.
Last updated