Distributed by Region (py)

This applet creates a count of reads from a BAM format file.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

"runSpec": {
  ...
  "execDepends": [
    {"name": "samtools"}
  ]
}

For additional information, refer to the execDepends documentation.

Entry Points

Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

  • main

  • samtoolscount_bam

  • combine_files

Entry points are executed on a new worker with their own system requirements. In this example, the files are split and merged on basic mem1_ssd1_x2 instances and the more intensive processing step is performed on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements:

main

The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

Regions bins are passed to the samtoolscount_bam entry point using the dxpy.new_dxjob function.

Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

samtoolscount_bam

This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.

This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.

combine_files

The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. Notice in the .runSpec JSON above the process starts with a lightweight instance, scales up for the processing entry point, then finally scales down for the gathering step.

Last updated

Was this helpful?