Parallel by Region (sh)

This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait (Ubuntu 14.04+).

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

Debugging

The command set -e -x -o pipefail will assist you in debugging this applet:

  • -e causes the shell to immediately exit if a command returns a non-zero exit code.

  • -x prints commands as they are executed, which is very useful for tracking the job’s status or pinpointing the exact execution failure.

  • -o pipefail makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)

    set -e -x -o pipefail
    echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
    echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"
    
    mkdir workspace
    cd workspace
    dx download "${mappings_sorted_bam}"
    
    if [ -z "$mappings_sorted_bai" ]; then
      samtools index "$mappings_sorted_bam_name"
    else
      dx download "${mappings_sorted_bai}"
    fi

    The *.bai file was an optional job input. You can check for a empty or unset var using the bash built-in test [[ - z ${var}} ]]. You can then download or create a *.bai index as needed.

Parallel Run

Bash’s job control system allows for easy management of multiple processes. In this example, bash commands are run in the background as the maximum job executions are controlled in the foreground. You can place processes in the background using the character & after a command.

  chromosomes=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

  for chr in $chromosomes; do
    samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
    echo "bam_${chr}.bam"
  done > bamfiles.txt

  busyproc=0
  while read -r b_file; do
    echo "${b_file}"
    if [[ "${busyproc}" -ge "$(nproc)" ]]; then
      echo Processes hit max
      while [[ "${busyproc}" -gt  0 ]]; do
        wait -n # p_id
        busyproc=$((busyproc-1))
      done
    fi
    samtools view -c "${b_file}"> "count_${b_file%.bam}" &
    busyproc=$((busyproc+1))
  done <bamfiles.txt
  while [[ "${busyproc}" -gt  0 ]]; do
    wait -n # p_id
    busyproc=$((busyproc-1))
  done

Job Output

Once the input bam has been sliced, counted, and summed, the output counts_txt is uploaded using the command dx-upload-all-outputs. The following directory structure required for dx-upload-all-outputs is below:

├── $HOME
│   ├── out
│       ├── < output name in dxapp.json >
│           ├── output file

In your applet, upload all outputs by:

  outputdir="${HOME}/out/counts_txt"
  mkdir -p "${outputdir}"
  cat count* | awk '{sum+=$1} END{print "Total reads = ",sum}' > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"

  dx-upload-all-outputs

Last updated

Copyright 2024 DNAnexus