# Parallel by Chr (py)

[View full source code on GitHub](https://github.com/dnanexus/dnanexus-example-applets/tree/master/Tutorials/python/samtools_count_para_chr_subprocess_py)

To take full advantage of the scalability that cloud computing offers, your scripts have to implement the correct methodologies. This applet tutorial shows you how to:

1. Install SAMtools
2. Download BAM file
3. Count regions in parallel

## How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an [Apt-Get](https://manpages.ubuntu.com/manpages/xenial/man8/apt-get.8.html) package in the `dxapp.json` `runSpec.execDepends`.

```json
  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }
```

For additional information, refer to the [`execDepends` documentation](/getting-started/developer-tutorials/bash/git-dependency.md#how-is-samtools-called-in-the-src-script).

## Download BAM file

The `dxpy.download_all_inputs()` function downloads all input files into the `/home/dnanexus/in` directory. A folder is created for each input and the files are downloaded to that directory. For convenience, the `dxpy.download_all_inputs` function returns a dictionary containing the following keys:

* `<var>_path` (**string**): full absolute path to where the file was downloaded.
* `<var>_name` (**string**): name of the file, including extension.
* `<var>_prefix` (**string**): name of the file minus the longest matching pattern found in the `dxapp.json` I/O pattern field.

The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the `dxapp.json`. In this example, the dictionary has the following key-value pairs:

```
{
    mappings_bam_path: ['/home/dnanexus/in/mappings_bam/SRR504516.bam']
    mappings_bam_name: ['SRR504516.bam']
    mappings_bam_prefix: ['SRR504516']
    index_file_path: ['/home/dnanexus/in/index_file/SRR504516.bam.bai']
    index_file_name: ['SRR504516.bam.bai']
    index_file_prefix: ['SRR504516']
}
```

```python
inputs = dxpy.download_all_inputs()
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
```

## Count Regions in Parallel

Before performing the parallel SAMtools count, determine the workload for each thread. The number of workers is arbitrarily set to `10` and the workload per thread is set to `1` chromosome at a time. Python offers multiple ways to achieve multithreaded processing. For the sake of simplicity, use [`multiprocessing.dummy`](https://docs.python.org/3/library/multiprocessing.html), a wrapper around Python's threading module.

```python
input_bam = inputs['mappings_bam_name'][0]

bam_to_use = create_index_file(input_bam)
print("Dir info:")
print(os.listdir(os.getcwd()))

regions = parseSAM_header_for_region(bam_to_use)

view_cmds = [
    create_region_view_cmd(bam_to_use, region)
    for region
    in regions]

print('Parallel counts')
t_pools = ThreadPool(10)
results = t_pools.map(run_cmd, view_cmds)
t_pools.close()
t_pools.join()

verify_pool_status(results)
```

Each worker creates a *string* to be called in a `subprocess.Popen` call. The `multiprocessing.dummy.Pool.map(<func>, <iterable>)` function is used to call the helper function `run_cmd` for each *string* in the iterable of *view commands*. Because multithreaded processing is performed using `subprocess.Popen`, the process does not alert you to any failed processes. Closed workers are verified in the `verify_pool_status` helper function.

```python
def verify_pool_status(proc_tuples):
    err_msgs = []
    for proc in proc_tuples:
        if proc[2] != 0:
            err_msgs.append(proc[1])
    if err_msgs:
        raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))
```

**Important:** In this example, `subprocess.Popen` is used to process and verify results in `verify_pool_status`. In general, it is considered good practice to use Python's built-in subprocess convenience functions. In this case, `subprocess.check_call` would achieve the same goal.

## Gather Results

Each worker returns a read count of only one region in the BAM file. Sum and output the results as the job output. The dx-toolkit Python SDK function [`dxpy.upload_local_file`](http://autodoc.dnanexus.com/bindings/python/current/dxpy_dxfile.html?highlight=upload_local_file#dxpy.bindings.dxfile_functions.upload_local_file) is used to upload and generate a `DXFile` corresponding to the result file. For Python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the `dxapp.json` and the values being the output values for the corresponding output classes. For files, the output type is a [`DXLink`](/user/helpstrings-of-sdk-command-line-utilities.md#dx-jobutil-dxlink). The [`dxpy.dxlink`](http://autodoc.dnanexus.com/bindings/python/current/dxpy_functions.html?highlight=dxlink#dxpy.bindings.dxdataobject_functions.dxlink) function is used to generate the appropriate `DXLink` value.

```python
resultfn = bam_to_use[:-4] + '_count.txt'
with open(resultfn, 'w') as f:
    sum_reads = 0
    for res, reg in zip(results, regions):
        read_count = int(res[0])
        sum_reads += read_count
        f.write("Region {0}: {1}\n".format(reg, read_count))
    f.write("Total reads: {0}".format(sum_reads))

count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)

return output
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.dnanexus.com/getting-started/developer-tutorials/concurrent-computing-tutorials/parallel/parallel-by-chr-py.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
