# Parallel by Region (sh)

[View full source code on GitHub](https://github.com/dnanexus/dnanexus-example-applets/tree/master/Tutorials/bash/samtools_count_para_chr_busyproc_sh)

## How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an [Apt-Get](https://help.ubuntu.com/community/AptGet/Howto/) package in the `dxapp.json` `runSpec.execDepends`.

```json
  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }
```

## Debugging

The command `set -e -x -o pipefail` assists you in debugging this applet:

* `-e` causes the shell to immediately exit if a command returns a non-zero exit code.
* `-x` prints commands as they are executed, which is useful for tracking the job's status or pinpointing the exact execution failure.
* `-o pipefail` makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)

```shell
set -e -x -o pipefail
echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"

mkdir workspace
cd workspace
dx download "${mappings_sorted_bam}"

if [ -z "$mappings_sorted_bai" ]; then
  samtools index "$mappings_sorted_bam_name"
else
  dx download "${mappings_sorted_bai}"
fi
```

The `*.bai` file was an optional job input. You can check for a empty or unset `var` using the bash built-in test `[[ - z ${var}} ]]`. You can then download or create a `*.bai` index as needed.

## Parallel Run

Bash's [job control](https://tldp.org/LDP/abs/html/x9644.html) system allows for convenient management of multiple processes. In this example, bash commands are run in the background as the maximum job executions are controlled in the foreground. You can place processes in the background using the character `&` after a command.

```shell
# Extract valid chromosome names from BAM header
chromosomes=$(
  samtools view -H "${mappings_sorted_bam_name}" | \
  grep "@SQ" | \
  awk -F '\t' '{print $2}' | \
  awk -F ':' '{
    if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {
      print $2
    }
  }'
)

# Split BAM by chromosome and record output file names
for chr in $chromosomes; do
  samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
  echo "bam_${chr}.bam"
done > bamfiles.txt

# Parallel counting of reads per chromosome BAM
busyproc=0

while read -r b_file; do
  echo "${b_file}"

  # If busy processes hit limit, wait for one to finish
  if [[ "${busyproc}" -ge "$(nproc)" ]]; then
    echo "Processes hit max"
    while [[ "${busyproc}" -gt 0 ]]; do
      wait -n
      busyproc=$((busyproc - 1))
    done
  fi

  # Count reads in background
  samtools view -c "${b_file}" > "count_${b_file%.bam}" &
  busyproc=$((busyproc + 1))

done < bamfiles.txt
```

```shell
while [[ "${busyproc}" -gt  0 ]]; do
  wait -n # p_id
  busyproc=$((busyproc-1))
done
```

## Job Output

Once the input BAM has been sliced, counted, and summed, the output `counts_txt` is uploaded using the command [`dx-upload-all-outputs`](https://documentation.dnanexus.com/user/helpstrings-of-sdk-command-line-utilities#dx-upload-all-outputs). The following directory structure required for dx-upload-all-outputs is below:

```
├── $HOME
│   ├── out
│       ├── < output name in dxapp.json >
│           ├── output file
```

In your applet, upload all outputs by creating the output directory and then using `dx-upload-all-outputs` to upload the output files.

```shell
outputdir="${HOME}/out/counts_txt"
mkdir -p "${outputdir}"
cat count* \
  | awk '{sum+=$1} \
  END{print "Total reads = ",sum}' \
  > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"

dx-upload-all-outputs
```
