Distributed by Region (sh)

View full source code on GitHub

Entry Points

Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:

  • main

  • count_func

  • sum_reads


The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions will be sent, as input, to the count_func entry point using dx-jobutil-new-job command.

  regions=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')

  echo "Segmenting into regions"
  for r in $(echo $regions); do
    if [[ "${counter}" -ge 10 ]]; then
      echo "${temparray[@]}"
      count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
    temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s

  if [[ counter -gt 0 ]]; then # Previous loop will miss last iteration  if its < 10
    echo "${temparray[@]}"
    count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))

Job outputs from the count_func entry point are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

  echo "Merge count files, jobs:"
  echo "${count_jobs[@]}"
  for count_job in "${count_jobs[@]}"; do
  echo "file name: ${sorted_bamfile_name}"
  echo "Set file, readfile variables:"
  echo "${readfiles[@]}"
  countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)

Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output command.

  echo "Specifying output file"
  dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref


This entry point performs a SAMtools count of the 10 regions passed as input. This execution will be run on a new worker. As a result variables from other functions (e.g. main()) will not be accessible here.

Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point’s job output counts_txt via the command dx-jobutil-add-output.

count_func() {

  set -e -x -o pipefail

  echo "Value of bam_file: '${bam_file}'"
  echo "Value of bambai_file: '${bambai_file}'"
  echo "Regions being counted '${regions[@]}'"


  mkdir workspace
  cd workspace || exit
  mv "${bam_file_path}" .
  mv "${bambai_file_path}" .
  mkdir -p "${outputdir}"
  samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"

  counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
  dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file


The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.

This entry point returns read_sum as a JBOR, which is then referenced as job output.

sum_reads() {

  set -e -x -o pipefail
  echo "$filename"

  echo "Value of read file array '${readfiles[@]}'"
  echo "Value of read file path array '${readfiles_path[@]}'"

  echo "Summing values in files"
  for read_f in "${readfiles_path[@]}"; do
    temp=$(cat "$read_f")
    readsum=$((readsum + temp))

  echo "Total reads: ${readsum}" > "${filename}_counts.txt"

  read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
  dx-jobutil-add-output read_sum "${read_sum_id}" --class=file

In the main function, the output is referenced

  echo "Specifying output file"
  dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref

Last updated

Copyright 2024 DNAnexus