Pysam
This applet performs a SAMtools count on an input BAM using Pysam, a python wrapper for SAMtools.

How is Pysam provided?

Pysam is provided through a pip3 install using the pip3 package manager in the dxapp.json’s runSpec.execDepends property:
1
{
2
"runSpec": {
3
...
4
"execDepends": [
5
{"name": "pysam",
6
"package_manager": "pip3",
7
"version": "0.15.4"
8
}
9
]
10
...
11
}
Copied!
The execDepends value is a JSON array of dependencies to resolve before the applet source code is run. In this applet, we specify pip3 as our package manager and pysam version 0.15.4 as the dependency to resolve.

Downloading Input

The fields mappings_sorted_bam and mappings_sorted_bai are passed to the main function as parameters for our job. These parameters are dictionary objects with key-value pair {"$dnanexus_link": "<file>-<xxxx>"}. We handle file objects from the platform through DXFile handles. If an index file is not supplied, then a *.bai index will be created.
1
print(mappings_sorted_bai)
2
print(mappings_sorted_bam)
3
4
mappings_sorted_bam = dxpy.DXFile(mappings_sorted_bam)
5
sorted_bam_name = mappings_sorted_bam.name
6
dxpy.download_dxfile(mappings_sorted_bam.get_id(),
7
sorted_bam_name)
8
ascii_bam_name = unicodedata.normalize( # Pysam requires ASCII not Unicode string.
9
'NFKD', sorted_bam_name).encode('ascii', 'ignore')
10
11
if mappings_sorted_bai is not None:
12
mappings_sorted_bai = dxpy.DXFile(mappings_sorted_bai)
13
dxpy.download_dxfile(mappings_sorted_bai.get_id(),
14
mappings_sorted_bai.name)
15
else:
16
pysam.index(ascii_bam_name)
Copied!

Working with Pysam

Pysam provides several methods that mimic SAMtools commands. In our applet example, we want to focus only on canonical chromosomes. Pysam’s object representation of a BAM file is pysam.AlignmentFile.
1
mappings_obj = pysam.AlignmentFile(ascii_bam_name, "rb")
2
regions = get_chr(mappings_obj, canonical_chr)
Copied!
The helper function get_chr
1
def get_chr(bam_alignment, canonical=False):
2
"""Helper function to return canonical chromosomes from SAM/BAM header
3
4
Arguments:
5
bam_alignment (pysam.AlignmentFile): SAM/BAM pysam object
6
canonical (boolean): Return only canonical chromosomes
7
Returns:
8
regions (list[str]): Region strings
9
"""
10
regions = []
11
headers = bam_alignment.header
12
seq_dict = headers['SQ']
13
14
if canonical:
15
re_canonical_chr = re.compile(r'^chr[0-9XYM]+$|^[0-9XYM]')
16
for seq_elem in seq_dict:
17
if re_canonical_chr.match(seq_elem['SN']):
18
regions.append(seq_elem['SN'])
19
else:
20
regions = [''] * len(seq_dict)
21
for i, seq_elem in enumerate(seq_dict):
22
regions[i] = seq_elem['SN']
23
24
return regions
Copied!
Once we establish a list of canonical chromosomes, we can iterate over them and perform Pysam’s version of samtools view -c, pysam.AlignmentFile.count.
1
total_count = 0
2
count_filename = "{bam_prefix}_counts.txt".format(
3
bam_prefix=ascii_bam_name[:-4])
4
5
with open(count_filename, "w") as f:
6
for region in regions:
7
temp_count = mappings_obj.count(region=region)
8
f.write("{region_name}: {counts}\n".format(
9
region_name=region, counts=temp_count))
10
total_count += temp_count
11
12
f.write("Total reads: {sum_counts}".format(sum_counts=total_count))
Copied!

Uploading Outputs

Our summarized counts are returned as the job output. We use the dx-toolkit python SDK’s dxpy.upload_local_file function to upload and generate a DXFile corresponding to our tabulated result file.
1
counts_txt = dxpy.upload_local_file(count_filename)
2
output = {}
3
output["counts_txt"] = dxpy.dxlink(counts_txt)
4
5
return output
Copied!
Python job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json file and the values being the output values for corresponding output classes. For files, the output type is a DXLink. We use the dxpy.dxlink function to generate the appropriate DXLink value.
Last modified 1yr ago