Distributed by Chr (sh)
View full source code on GitHub
How is the SAMtools dependency provided?
The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json
file's runSpec.execDepends
.
{
...
"runSpec": {
...
"execDepends": [
{
"name": "samtools"
}
]
}
...
}
For additional information, see execDepends
.
Entry Points
Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:
main
count_func
sum_reads
Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json
file's runSpec.systemRequirements
:
{
"runSpec": {
...
"systemRequirements": {
"main": {
"instanceType": "mem1_ssd1_x4"
},
"count_func": {
"instanceType": "mem1_ssd1_x2"
},
"sum_reads": {
"instanceType": "mem1_ssd1_x4"
}
},
...
}
}
main
main
The main
function slices the initial *.bam
file and generates an index *.bai
if needed. The input *.bam
is then sliced into smaller *.bam
files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.
dx download "${mappings_sorted_bam}" \
chromosomes=$( \
samtools view -H "${mappings_sorted_bam_name}" \
| grep "\@SQ" \
| awk -F '\t' '{print $2}' \
| awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
Sliced *.bam
files are uploaded and their file IDs are passed to the count_func
entry point using the dx-jobutil-new-job
command.
if [ -z "${mappings_sorted_bai}" ]; then
samtools index "${mappings_sorted_bam_name}"
else
dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}.bai"
fi
count_jobs=()
for chr in $chromosomes; do
seg_name="${mappings_sorted_bam_prefix}_${chr}.bam"
samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"
bam_seg_file=$(dx upload "${seg_name}" --brief)
count_jobs+=($(dx-jobutil-new-job \
-isegmentedbam_file="${bam_seg_file}" \
-ichr="${chr}" \
count_func))
done
Outputs from the count_func
entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads
entry point.
for job in "${count_jobs[@]}"; do
readfiles+=("-ireadfiles=${job}:counts_txt")
done
sum_reads_job=$(
dx-jobutil-new-job \
"${readfiles[@]}" \
-ifilename="${mappings_sorted_bam_prefix}" \
sum_reads
)
The output of the sum_reads
entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output
.
count_func
count_func
This entry point downloads and runs the command samtools view -c
on the sliced *.bam
. The generated counts_txt
output file is uploaded as the entry point's job output via the command dx-jobutil-add-output
.
count_func () {
echo "Value of segmentedbam_file: '${segmentedbam_file}'"
echo "Chromosome being counted '${chr}'"
dx download "${segmentedbam_file}"
readcount=$(samtools view -c "${segmentedbam_file_name}")
printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt"
readcount_file=$(dx upload "${segmentedbam_file_prefix}.txt" --brief)
dx-jobutil-add-output counts_txt "${readcount_file}" --class=file
}
sum_reads
sum_reads
The main
entry point triggers this sub job, providing the output of count_func
as an input. This entry point gathers all the files generated by the count_func
jobs and sums them.
This function returns read_sum_file
as the entry point output.
sum_reads () {
set -e -x -o pipefail
printf "Value of read file array %s" "${readfiles[@]}"
echo "Filename: ${filename}"
echo "Summing values in files and creating output read file"
for read_f in "${readfiles[@]}"; do
echo "${read_f}"
dx download "${read_f}" -o - >> chromosome_result.txt
done
count_file="${filename}_chromosome_count.txt"
total=$(awk '{s+=$2} END {print s}' chromosome_result.txt)
echo "Total reads: ${total}" >> "${count_file}"
readfile_name=$(dx upload "${count_file}" --brief)
dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file
}
Last updated
Was this helpful?