Distributed by Region (py)
This applet creates a count of reads from a BAM format file.
View full source code on GitHub
How is the SAMtools dependency provided?
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.
"runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}For additional information, refer to the execDepends documentation.
Entry Points
Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:
mainsamtoolscount_bamcombine_files
Entry points are executed on a new worker with their own system requirements. In this example, the applet splits and merges files on basic mem1_ssd1_x2 instances and performs a more intensive processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json's runSpec.systemRequirements:
main
mainThe main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.
Regions bins are passed to the samtoolscount_bam entry points in the dxpy.new_dxjob function.
Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.
samtoolscount_bam
samtoolscount_bamThis entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.
This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.
combine_files
combine_filesThe main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.
Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. The .runSpec JSON shows a workflow that starts with a lightweight instance, scales up for the processing entry point, and then scales down for the gathering step.
Last updated
Was this helpful?