Distributed by Region (py)
This applet creates a count of reads from a BAM format file.

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.
1
"runSpec": {
2
...
3
"execDepends": [
4
{"name": "samtools"}
5
]
6
}
Copied!
For additional information, please refer to the execDepends documentation .

Entry Points

Distributed python-interpreter apps use python decorators on functions to declare entry points. This app has the following entry points as decorated functions:
    main
    samtoolscount_bam
    combine_files
Entry points are executed on a new worker with their own system requirements. In this example, we split and merge our files on basic mem1_ssd1_x2 instances and perform our own, more intensive, processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements:
1
"runSpec": {
2
...
3
"systemRequirements": {
4
"main": {
5
"instanceType": "mem1_ssd1_x2"
6
},
7
"samtoolscount_bam": {
8
"instanceType": "mem1_ssd1_x4"
9
},
10
"combine_files": {
11
"instanceType": "mem1_ssd1_x2"
12
}
13
},
14
...
15
}
Copied!

main

The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.
1
regions = parseSAM_header_for_region(filename)
2
split_regions = [regions[i:i + region_size]
3
for i in range(0, len(regions), region_size)]
4
5
if not index_file:
6
mappings_bam, index_file = create_index_file(filename, mappings_bam)
Copied!
Regions bins are passed to the samtoolscount_bam entry point using the dxpy.new_dxjob function.
1
print('creating subjobs')
2
subjobs = [dxpy.new_dxjob(
3
fn_input={"region_list": split,
4
"mappings_bam": mappings_bam,
5
"index_file": index_file},
6
fn_name="samtoolscount_bam")
7
for split in split_regions]
8
9
fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
10
for subjob in subjobs]
Copied!
Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.
1
print('combining outputs')
2
postprocess_job = dxpy.new_dxjob(
3
fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
4
fn_name="combine_files")
5
6
countDXLink = postprocess_job.get_output_ref("countDXLink")
7
8
output = {}
9
output["count_file"] = countDXLink
10
11
return output
Copied!

samtoolscount_bam

This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.
1
def samtoolscount_bam(region_list, mappings_bam, index_file):
2
"""Processing function.
3
4
Arguments:
5
region_list (list[str]): Regions to count in BAM
6
mappings_bam (dict): dxlink to input BAM
7
index_file (dict): dxlink to input BAM
8
9
Returns:
10
Dictionary containing dxlinks to the uploaded read counts file
11
"""
12
#
13
# Download inputs
14
# -------------------------------------------------------------------
15
# dxpy.download_all_inputs will download all input files into
16
# the /home/dnanexus/in directory. A folder will be created for each
17
# input and the file(s) will be download to that directory.
18
#
19
# In this example our dictionary inputs has the following key, value pairs
20
# Note that the values are all list
21
# mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
22
# mappings_bam_name: [u'<bam filename>.bam']
23
# mappings_bam_prefix: [u'<bam filename>']
24
# index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
25
# index_file_name: [u'<bam filename>.bam.bai']
26
# index_file_prefix: [u'<bam filename>']
27
#
28
29
inputs = dxpy.download_all_inputs()
30
31
# SAMtools view command requires the bam and index file to be in the same
32
shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
33
shutil.move(inputs['index_file_path'][0], os.getcwd())
34
input_bam = inputs['mappings_bam_name'][0]
35
36
#
37
# Per region perform SAMtools count.
38
# --------------------------------------------------------------
39
# Output count for regions and return DXLink as job output to
40
# allow other entry points to download job output.
41
#
42
43
with open('read_count_regions.txt', 'w') as f:
44
for region in region_list:
45
view_cmd = create_region_view_cmd(input_bam, region)
46
region_proc_result = run_cmd(view_cmd)
47
region_count = int(region_proc_result[0])
48
f.write("Region {0}: {1}\n".format(region, region_count))
49
readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
50
readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())
51
52
return {"readcount_fileDX": readCountDXlink}
Copied!
This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You’re able to pass types other than file such as int.

combine_files

The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.
1
def combine_files(countDXlinks, resultfn):
2
"""The 'gather' subjob of the applet.
3
4
Arguments:
5
countDXlinks (list[dict]): list of DXlinks to process job output files.
6
resultfn (str): Filename to use for job output file.
7
8
Returns:
9
DXLink for the main function to return as the job output.
10
11
Note: Only the DXLinks are passed as parameters.
12
Subjobs work on a fresh instance so files must be downloaded to the machine
13
"""
14
if resultfn.endswith(".bam"):
15
resultfn = resultfn[:-4] + '.txt'
16
17
sum_reads = 0
18
with open(resultfn, 'w') as f:
19
for i, dxlink in enumerate(countDXlinks):
20
dxfile = dxpy.DXFile(dxlink)
21
filename = "countfile{0}".format(i)
22
dxpy.download_dxfile(dxfile, filename)
23
with open(filename, 'r') as fsub:
24
for line in fsub:
25
sum_reads += parse_line_for_readcount(line)
26
f.write(line)
27
f.write('Total Reads: {0}'.format(sum_reads))
28
29
countDXFile = dxpy.upload_local_file(resultfn)
30
countDXlink = dxpy.dxlink(countDXFile.get_id())
31
32
return {"countDXLink": countDXlink}
Copied!
Important: While the main entry point triggers the processing and gathering entry points, keep in mind the main entry point doesn’t do any heavy lifting or processing. Notice in the .runSpec json above we start with a lightweight instance, scale up for the processing entry point, then finally scale down for the gathering step.
Last modified 19d ago