Parallel xargs by Chr
This applet slices a BAM file by canonical chromosome and performs a parallelized SAMtools view.
View full source code on GitHub
How is the SAMtools dependency provided?
The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory are uploaded so that they are present in the root directory of the worker. In this case:
├── Applet dir
│ ├── src
│ ├── dxapp.json
│ ├── resources
│ ├── usr
│ ├── bin
│ ├── < samtools binary >When this applet is run on a worker, the resources/ folder is placed in the worker's root directory /:
/
├── usr
│ ├── bin
│ ├── < samtools binary >
├── home
│ ├── dnanexus/usr/bin is part of the $PATH variable, so in the script, you can reference the samtools command directly, as in samtools view -c ....
Parallel Run
Splice BAM
First, download the BAM file and slice it by canonical chromosome, writing the *bam file names to another file.
To split a BAM by regions, you need to have a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, the *.bai is generated in the applet, sorting the BAM if necessary.
dx download "${mappings_bam}"
indexsuccess=true
bam_filename="${mappings_bam_name}"
samtools index "${mappings_bam_name}" || indexsuccess=false
if [[ $indexsuccess == false ]]; then
samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
samtools index "${mappings_bam_name}"
bam_filename="${mappings_bam_name}"
fi
chromosomes=$( \
samtools view -H "${bam_filename}" \
| grep "\@SQ" \
| awk -F '\t' '{print $2}' \
| awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')
for chr in $chromosomes; do
samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}."bam
echo "bam_${chr}.bam"
done > bamfiles.txtXargs SAMtools view
In the previous section, the name of each sliced BAM file was recorded into a record file. Next, perform a samtools view -c on each slice using the record file as input.
counts_txt_name="${mappings_bam_prefix}_count.txt"
sum_reads=$( \
<bamfiles.txt xargs -I {} samtools view -c $view_options '{}' \
| awk '{s+=$1} END {print s}')
echo "Total Count: ${sum_reads}" > "${counts_txt_name}"Upload results
The results file is uploaded using the standard bash process:
Upload a file to the job execution's container.
Provide the DNAnexus link as a job's output using the script
dx-jobutil-add-output <output name>counts_txt_id=$(dx upload "${counts_txt_name}" --brief) dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
Last updated
Was this helpful?