# Pysam

[View full source code on GitHub](https://github.com/dnanexus/dnanexus-example-applets/tree/master/Tutorials/python/pysam_count)

## How is Pysam provided?

Pysam is provided through a `pip3 install` using the pip3 package manager in the `dxapp.json`'s `runSpec.execDepends` property:

```json
{
 "runSpec": {
    ...
    "execDepends": [
      {"name": "pysam",
         "package_manager": "pip3",
         "version": "0.15.4"
      }
    ]
    ...
 }
```

The `execDepends` value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, `pip3` is specified as the package manager and `pysam version 0.15.4` as the dependency to resolve.

## Downloading Input

The fields `mappings_sorted_bam` and `mappings_sorted_bai` are passed to the main function as parameters for the job. These parameters are dictionary objects with key-value pair `{"$dnanexus_link": "<file>-<xxxx>"}`. File objects from the platform are handled through [`DXFile`](http://autodoc.dnanexus.com/bindings/python/current/dxpy_dxfile.html?highlight=dxfile#module-dxpy.bindings.dxfile) handles. If an index file is not supplied, then a `*.bai` index is created.

```python
print(mappings_sorted_bai)
print(mappings_sorted_bam)

mappings_sorted_bam = dxpy.DXFile(mappings_sorted_bam)
sorted_bam_name = mappings_sorted_bam.name
dxpy.download_dxfile(mappings_sorted_bam.get_id(),
                        sorted_bam_name)
ascii_bam_name = unicodedata.normalize(  # Pysam requires ASCII not Unicode string.
    'NFKD', sorted_bam_name).encode('ascii', 'ignore')

if mappings_sorted_bai is not None:
    mappings_sorted_bai = dxpy.DXFile(mappings_sorted_bai)
    dxpy.download_dxfile(mappings_sorted_bai.get_id(),
                            mappings_sorted_bai.name)
else:
    pysam.index(ascii_bam_name)
```

## Working with Pysam

Pysam provides key methods that mimic SAMtools commands. In this applet example, the focus is only on canonical chromosomes. The Pysam object representation of a BAM file is `pysam.AlignmentFile`.

```python
mappings_obj = pysam.AlignmentFile(ascii_bam_name, "rb")
regions = get_chr(mappings_obj, canonical_chr)
```

The helper function `get_chr`

```python
def get_chr(bam_alignment, canonical=False):
    """Helper function to return canonical chromosomes from SAM/BAM header

    Arguments:
        bam_alignment (pysam.AlignmentFile): SAM/BAM pysam object
        canonical (boolean): Return only canonical chromosomes
    Returns:
        regions (list[str]): Region strings
    """
    regions = []
    headers = bam_alignment.header
    seq_dict = headers['SQ']

    if canonical:
        re_canonical_chr = re.compile(r'^chr[0-9XYM]+$|^[0-9XYM]')
        for seq_elem in seq_dict:
            if re_canonical_chr.match(seq_elem['SN']):
                regions.append(seq_elem['SN'])
    else:
        regions = [''] * len(seq_dict)
        for i, seq_elem in enumerate(seq_dict):
            regions[i] = seq_elem['SN']

    return regions
```

Once a list of canonical chromosomes is established, you can iterate over them and perform the Pysam version of `samtools view -c`, `pysam.AlignmentFile.count`.

```python
total_count = 0
count_filename = "{bam_prefix}_counts.txt".format(
    bam_prefix=ascii_bam_name[:-4])

with open(count_filename, "w") as f:
    for region in regions:
        temp_count = mappings_obj.count(region=region)
        f.write("{region_name}: {counts}\n".format(
            region_name=region, counts=temp_count))
        total_count += temp_count

    f.write("Total reads: {sum_counts}".format(sum_counts=total_count))
```

## Uploading Outputs

The summarized counts are returned as the job output. The `dx-toolkit` Python SDK function [`dxpy.upload_local_file`](http://autodoc.dnanexus.com/bindings/python/current/dxpy_dxfile.html#dxpy.bindings.dxfile_functions.upload_local_file) uploads and generates a `DXFile` corresponding to the tabulated result file.

```python
counts_txt = dxpy.upload_local_file(count_filename)
output = {}
output["counts_txt"] = dxpy.dxlink(counts_txt)

return output
```

Python job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the `dxapp.json` file and the values being the output values for corresponding output classes. For files, the output type is a `DXLink`. The [`dxpy.dxlink`](http://autodoc.dnanexus.com/bindings/python/current/dxpy_functions.html?highlight=dxlink#dxpy.bindings.dxdataobject_functions.dxlink) function generates the appropriate `DXLink` value.
