Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
main
The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions will be sent, as input, to the count_func entry point using command.
regions=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')
echo "Segmenting into regions"
count_jobs=()
counter=0
temparray=()
for r in $(echo $regions); do
if [[ "${counter}" -ge 10 ]]; then
echo "${temparray[@]}"
count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
temparray=()
counter=0
fi
temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s
counter=$((counter+1))
done
if [[ counter -gt 0 ]]; then # Previous loop will miss last iteration if its < 10
echo "${temparray[@]}"
count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
fi
echo "Merge count files, jobs:"
echo "${count_jobs[@]}"
readfiles=()
for count_job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${count_job}:counts_txt")
done
echo "file name: ${sorted_bamfile_name}"
echo "Set file, readfile variables:"
echo "${readfiles[@]}"
countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)
This entry point performs a SAMtools count of the 10 regions passed as input. This execution will be run on a new worker. As a result variables from other functions (e.g. main()) will not be accessible here.
The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.
This entry point returns read_sum as a JBOR, which is then referenced as job output.
sum_reads() {
set -e -x -o pipefail
echo "$filename"
echo "Value of read file array '${readfiles[@]}'"
dx-download-all-inputs
echo "Value of read file path array '${readfiles_path[@]}'"
echo "Summing values in files"
readsum=0
for read_f in "${readfiles_path[@]}"; do
temp=$(cat "$read_f")
readsum=$((readsum + temp))
done
echo "Total reads: ${readsum}" > "${filename}_counts.txt"
read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
dx-jobutil-add-output read_sum "${read_sum_id}" --class=file