Parallel by Region (sh)
This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait (Ubuntu 14.04+).
The SAMtools dependency is resolved by declaring an Apt-Get package in the
dxapp.json
runSpec.execDepends
. "runSpec": {
...
"execDepends": [
{"name": "samtools"}
]
}
The command
set -e -x -o pipefail
will assist you in debugging this applet:-e
causes the shell to immediately exit if a command returns a non-zero exit code.-x
prints commands as they are executed, which is very useful for tracking the job’s status or pinpointing the exact execution failure.-o pipefail
makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)set -e -x -o pipefailecho "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"mkdir workspacecd workspacedx download "${mappings_sorted_bam}"if [ -z "$mappings_sorted_bai" ]; thensamtools index "$mappings_sorted_bam_name"elsedx download "${mappings_sorted_bai}"fiThe*.bai
file was an optional job input. You can check for an empty or unsetvar
using the bash built-in test[[ - z ${var}} ]]
. Then, you can download or create a*.bai
index as needed.
Bash’s job control system allows for easy management of multiple processes. In this example, you can run bash commands in the background as you control maximum job executions in the foreground. Place processes in the background using the character
&
after a command. chromosomes=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
for chr in $chromosomes; do
samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
echo "bam_${chr}.bam"
done > bamfiles.txt
busyproc=0
while read -r b_file; do
echo "${b_file}"
if [[ "${busyproc}" -ge "$(nproc)" ]]; then
echo Processes hit max
while [[ "${busyproc}" -gt 0 ]]; do
wait -n # p_id
busyproc=$((busyproc-1))
done
fi
samtools view -c "${b_file}"> "count_${b_file%.bam}" &
busyproc=$((busyproc+1))
done <bamfiles.txt
while [[ "${busyproc}" -gt 0 ]]; do
wait -n # p_id
busyproc=$((busyproc-1))
done
Once the input bam has been sliced, counted, and summed, the output
counts_txt
is uploaded using the command dx-upload-all-outputs
. The following directory structure required for dx-upload-all-outputs is below:├── $HOME
│ ├── out
│ ├── < output name in dxapp.json >
│ ├── output file
In your applet, upload all outputs by:
outputdir="${HOME}/out/counts_txt"
mkdir -p "${outputdir}"
cat count* | awk '{sum+=$1} END{print "Total reads = ",sum}' > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"
dx-upload-all-outputs
Last modified 1yr ago