# Parallel by Chr (py)

[View full source code on GitHub](https://github.com/dnanexus/dnanexus-example-applets/tree/master/Tutorials/python/samtools_count_para_chr_subprocess_py)

To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:

1. Install SAMtools
2. Download BAM file
3. Count regions in parallel

## How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an [Apt-Get](https://manpages.ubuntu.com/manpages/xenial/man8/apt-get.8.html) package in the `dxapp.json` `runSpec.execDepends`.

```json
  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }
```

For additional information, refer to the [`execDepends` documentation](https://documentation.dnanexus.com/getting-started/bash/git-dependency#how-is-samtools-called-in-the-src-script).

## Download BAM file

The `dxpy.download_all_inputs()` function downloads all input files into the `/home/dnanexus/in` directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the `dxpy.download_all_inputs` function returns a dictionary containing the following keys:

* `<var>_path` (**string**): full absolute path to where the file was downloaded.
* `<var>_name` (**string**): name of the file, including extension.
* `<var>_prefix` (**string**): name of the file minus the longest matching pattern found in the `dxapp.json` I/O pattern field.

The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the `dxapp.json`. In this example, the dictionary has the following key-value pairs:

```
{
    mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
    mappings_bam_name: [u'SRR504516.bam']
    mappings_bam_prefix: [u'SRR504516']
    index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
    index_file_name: [u'SRR504516.bam.bai']
    index_file_prefix: [u'SRR504516']
}
```

```python
inputs = dxpy.download_all_inputs()
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
```

## Count Regions in Parallel

Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to `10` and the workload per thread is set to `1` chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use [`multiprocessing.dummy`](https://docs.python.org/3/library/multiprocessing.html), a wrapper around Python's threading module.

```python
input_bam = inputs['mappings_bam_name'][0]

bam_to_use = create_index_file(input_bam)
print("Dir info:")
print(os.listdir(os.getcwd()))

regions = parseSAM_header_for_region(bam_to_use)

view_cmds = [
    create_region_view_cmd(bam_to_use, region)
    for region
    in regions]

print('Parallel counts')
t_pools = ThreadPool(10)
results = t_pools.map(run_cmd, view_cmds)
t_pools.close()
t_pools.join()

verify_pool_status(results)
```

Each worker creates a *string* to be called in a `subprocess.Popen` call. The `multiprocessing.dummy.Pool.map(<func>, <iterable>)` function is used to call the helper function `run_cmd` for each *string* in the iterable of *view commands*. Because multithreaded processing is performed using `subprocess.Popen`, the process does not alert to any failed processes. Closed workers are verified in the `verify_pool_status` helper function.

```python
def verify_pool_status(proc_tuples):
    err_msgs = []
    for proc in proc_tuples:
        if proc[2] != 0:
            err_msgs.append(proc[1])
    if err_msgs:
        raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
```

**Important:** In this example, `subprocess.Popen` is used to process and verify results in `verify_pool_status`. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, `subprocess.check_call` would achieve the same goal.

## Gather Results

Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function [`dxpy.upload_local_file`](http://autodoc.dnanexus.com/bindings/python/current/dxpy_dxfile.html?highlight=upload_local_file#dxpy.bindings.dxfile_functions.upload_local_file) is used to upload and generate a `DXFile` corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the `dxapp.json` and the values being the output values for corresponding output classes. For files, the output type is a [`DXLink`](https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities#dx-jobutil-dxlink). The [`dxpy.dxlink`](http://autodoc.dnanexus.com/bindings/python/current/dxpy_functions.html?highlight=dxlink#dxpy.bindings.dxdataobject_functions.dxlink) function is used to generate the appropriate `DXLink` value.

```python
resultfn = bam_to_use[:-4] + '_count.txt'
with open(resultfn, 'w') as f:
    sum_reads = 0
    for res, reg in zip(results, regions):
        read_count = int(res[0])
        sum_reads += read_count
        f.write("Region {0}: {1}\n".format(reg, read_count))
    f.write("Total reads: {0}".format(sum_reads))

count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)

return output
```
