Distributed by Region (py)

This applet creates a count of reads from a BAM format file.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

For additional information, refer to the execDepends documentation.

Entry Points

Distributed Python-interpreter apps use Python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

  • main

  • samtoolscount_bam

  • combine_files

Entry points are executed on a new worker with their own system requirements. In this example, the applet splits and merges files on basic mem1_ssd1_x2 instances and performs a more intensive processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json's runSpec.systemRequirements:

main

The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

Regions bins are passed to the samtoolscount_bam entry points in the dxpy.new_dxjob function.

Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

samtoolscount_bam

This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.

This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You're able to pass types other than file such as int.

combine_files

The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

Important: While the main entry point triggers the processing and gathering entry points, remember that the main entry point doesn't do any heavy lifting or processing. The .runSpec JSON shows a workflow that starts with a lightweight instance, scales up for the processing entry point, and then scales down for the gathering step.

Last updated

Was this helpful?