1 of 10

Concurrent Computing Tutorials

Learn important terminology before using parallel and distributed computing paradigms on the DNAnexus Platform.

There are many definitions and approaches to tackling the concept of parallelization and distributing workloads in the cloud (Here’s a particularly helpful Stack Exchange post on the subject). To help make our documentation easier to understand, when discussing concurrent computing paradigms we’ll refer to:

Parallel: Using multiple threads or logical cores to concurrently process a workload.
Distributed: Using multiple machines (in our case instances in the cloud) that communicate to concurrently process a workload.

Keep these formal definitions in mind as you read through the tutorials and learn how to compute concurrently on the DNAnexus platform.

Distributed

Distributed by Region (sh)

Entry Points

Distributed bash-interpreter apps use bash functions to declare entry points. Entry points are executed as subjobs on new workers with their own respective system requirements. This app has the following entry points specified as bash functions:

main
count_func
sum_reads

main

The main function takes the initial *.bam, generates an index *.bai if needed, and obtains the list of regions from the *.bam file. Every 10 regions will be sent, as input, to the count_func entry point using dx-jobutil-new-job command.

  regions=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | sed 's/.*SN:\(\S*\)\s.*/\1/')

  echo "Segmenting into regions"
  count_jobs=()
  counter=0
  temparray=()
  for r in $(echo $regions); do
    if [[ "${counter}" -ge 10 ]]; then
      echo "${temparray[@]}"
      count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
      temparray=()
      counter=0
    fi
    temparray+=("-iregions=${r}") # Here we add to an array of -i<parameter>'s
    counter=$((counter+1))
  done

  if [[ counter -gt 0 ]]; then # Previous loop will miss last iteration  if its < 10
    echo "${temparray[@]}"
    count_jobs+=($(dx-jobutil-new-job -ibam_file="${mappings_sorted_bam}" -ibambai_file="${mappings_sorted_bai}" "${temparray[@]}" count_func))
  fi

Job outputs from the count_func entry point are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

  echo "Merge count files, jobs:"
  echo "${count_jobs[@]}"
  readfiles=()
  for count_job in "${count_jobs[@]}"; do
    readfiles+=("-ireadfiles=${count_job}:counts_txt")
  done
  echo "file name: ${sorted_bamfile_name}"
  echo "Set file, readfile variables:"
  echo "${readfiles[@]}"
  countsfile_job=$(dx-jobutil-new-job -ifilename="${mappings_sorted_bam_prefix}" "${readfiles[@]}" sum_reads)

Job outputs of the sum_reads entry point is used as the output of the main entry point via JBOR reference in the dx-jobutil-add-output command.

  echo "Specifying output file"
  dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
}

count_func

This entry point performs a SAMtools count of the 10 regions passed as input. This execution will be run on a new worker. As a result variables from other functions (e.g. main()) will not be accessible here.

Once the output file with counts is created, it is uploaded to the platform and assigned as the entry point’s job output counts_txt via the command dx-jobutil-add-output.

count_func() {

  set -e -x -o pipefail

  echo "Value of bam_file: '${bam_file}'"
  echo "Value of bambai_file: '${bambai_file}'"
  echo "Regions being counted '${regions[@]}'"


  dx-download-all-inputs


  mkdir workspace
  cd workspace || exit
  mv "${bam_file_path}" .
  mv "${bambai_file_path}" .
  outputdir="./out/samtool/count"
  mkdir -p "${outputdir}"
  samtools view -c "${bam_file_name}" "${regions[@]}" >> "${outputdir}/readcounts.txt"


  counts_txt_id=$(dx upload "${outputdir}/readcounts.txt" --brief)
  dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file
}

sum_reads

The main entry point triggers this subjob, providing the output of count_func as an input JBOR. This entry point gathers all the readcount.txt files generated by the count_func jobs and sums the totals.

This entry point returns read_sum as a JBOR, which is then referenced as job output.

sum_reads() {

  set -e -x -o pipefail
  echo "$filename"

  echo "Value of read file array '${readfiles[@]}'"
  dx-download-all-inputs
  echo "Value of read file path array '${readfiles_path[@]}'"

  echo "Summing values in files"
  readsum=0
  for read_f in "${readfiles_path[@]}"; do
    temp=$(cat "$read_f")
    readsum=$((readsum + temp))
  done

  echo "Total reads: ${readsum}" > "${filename}_counts.txt"

  read_sum_id=$(dx upload "${filename}_counts.txt" --brief)
  dx-jobutil-add-output read_sum "${read_sum_id}" --class=file

In the main function, the output is referenced

  echo "Specifying output file"
  dx-jobutil-add-output counts_txt "${countsfile_job}:read_sum" --class=jobref
}

Distributed by Chr (sh)

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json file’s runSpec.execDepends.

{

  ...
  "runSpec": {
  
    ...
    "execDepends": [
      {
        "name": "samtools"
      }
    ]
  }
  
  ...
}

For additional information, see execDepends.

Entry Points

Distributed bash-interpreter apps use bash functions to declare entry points. This app has the following entry points specified as bash functions:

main
count_func
sum_reads

Entry points are executed on a new worker with its own system requirements. The instance type can be set in the dxapp.json file’s runSpec.systemRequirements:

{
  "runSpec": {
  
    ...
    "systemRequirements": {
      "main": {
        "instanceType": "mem1_ssd1_x4"
      },
      
      "count_func": {
        "instanceType": "mem1_ssd1_x2"
      },
      
      "sum_reads": {
        "instanceType": "mem1_ssd1_x4"
      }
    },
    
    ...
  }
}

main

The main function slices the initial *.bam file and generates an index *.bai if needed. The input *.bam is the sliced into smaller *.bam files containing only reads from canonical chromosomes. First, the main function downloads the BAM file and gets the headers.

  dx download "${mappings_sorted_bam}"  chromosomes=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

Sliced *.bam files are uploaded and their file IDs are passed to the count_func entry point using the dx-jobutil-new-job command.

  if [ -z "${mappings_sorted_bai}" ]; then    samtools index "${mappings_sorted_bam_name}"  else    dx download "${mappings_sorted_bai}" -o "${mappings_sorted_bam_name}".bai  fi  count_jobs=()  for chr in $chromosomes; do    seg_name="${mappings_sorted_bam_prefix}_${chr}".bam    samtools view -b "${mappings_sorted_bam_name}" "${chr}" > "${seg_name}"    bam_seg_file=$(dx upload "${seg_name}" --brief)    count_jobs+=($(dx-jobutil-new-job -isegmentedbam_file="${bam_seg_file}" -ichr="${chr}" count_func))  done

Outputs from the count_func entry points are referenced as Job Based Object References (JBOR) and used as inputs for the sum_reads entry point.

  for job in "${count_jobs[@]}"; do    readfiles+=("-ireadfiles=${job}:counts_txt")  done  sum_reads_job=$(dx-jobutil-new-job "${readfiles[@]}" -ifilename="${mappings_sorted_bam_prefix}" sum_reads)

The output of the sum_reads entry point is used as the output of the main entry point via JBOR reference using the command dx-jobutil-add-output.

count_func

This entry point downloads and runs the command samtools view -c on the sliced *.bam. The generated counts_txt output file is uploaded as the entry point’s job output via the command dx-jobutil-add-output.

count_func () {     echo "Value of segmentedbam_file: '${segmentedbam_file}'";    echo "Chromosome being counted '${chr}'";    dx download "${segmentedbam_file}";    readcount=$(samtools view -c "${segmentedbam_file_name}");    printf "${chr}:\t%s\n" "${readcount}" > "${segmentedbam_file_prefix}.txt";    readcount_file=$(dx upload "${segmentedbam_file_prefix}".txt --brief);    dx-jobutil-add-output counts_txt "${readcount_file}" --class=file}

sum_reads

The main entry point triggers this sub job, providing the output of count_func as an input. This entry point gathers all the files generated by the count_func jobs and sums them.

This function returns read_sum_file as the entry point output.

sum_reads () {     set -e -x -o pipefail;    printf "Value of read file array %s" "${readfiles[@]}";    echo "Filename: ${filename}";    echo "Summing values in files and creating output read file";    for read_f in "${readfiles[@]}";    do        echo "${read_f}";        dx download "${read_f}" -o - >> chromosome_result.txt;    done;    count_file="${filename}_chromosome_count.txt";    total=$(awk '{s+=$2} END {print s}' chromosome_result.txt);    echo "Total reads: ${total}" >> "${count_file}";    readfile_name=$(dx upload "${count_file}" --brief);    dx-jobutil-add-output read_sum_file "${readfile_name}" --class=file}

Distributed by Region (py)

This applet creates a count of reads from a BAM format file.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

For additional information, please refer to the execDepends documentation .

Entry Points

Distributed python-interpreter apps use python decorators on functions to declare entry points. This app has the following entry points as decorated functions:

main
samtoolscount_bam
combine_files

Entry points are executed on a new worker with their own system requirements. In this example, we split and merge our files on basic mem1_ssd1_x2 instances and perform our own, more intensive, processing step on a mem1_ssd1_x4 instance. Instance type can be set in the dxapp.json runSpec.systemRequirements:

  "runSpec": {
    ...
    "systemRequirements": {
      "main": {
        "instanceType": "mem1_ssd1_x2"
      },
      "samtoolscount_bam": {
        "instanceType": "mem1_ssd1_x4"
      },
      "combine_files": {
        "instanceType": "mem1_ssd1_x2"
      }
    },
    ...
  }

main

The main function scatters by region bins based on user input. If no *.bai file is present, the applet generates an index *.bai.

    regions = parseSAM_header_for_region(filename)
    split_regions = [regions[i:i + region_size]
                     for i in range(0, len(regions), region_size)]

    if not index_file:
        mappings_bam, index_file = create_index_file(filename, mappings_bam)

Regions bins are passed to the samtoolscount_bam entry point using the dxpy.new_dxjob function.

    print('creating subjobs')
    subjobs = [dxpy.new_dxjob(
               fn_input={"region_list": split,
                         "mappings_bam": mappings_bam,
                         "index_file": index_file},
               fn_name="samtoolscount_bam")
               for split in split_regions]

    fileDXLinks = [subjob.get_output_ref("readcount_fileDX")
                   for subjob in subjobs]

Outputs from the samtoolscount_bam entry points are used as inputs for the combine_files entry point. The output of the combine_files entry point is used as the output of the main entry point.

    print('combining outputs')
    postprocess_job = dxpy.new_dxjob(
        fn_input={"countDXlinks": fileDXLinks, "resultfn": filename},
        fn_name="combine_files")

    countDXLink = postprocess_job.get_output_ref("countDXLink")

    output = {}
    output["count_file"] = countDXLink

    return output

samtoolscount_bam

This entry point downloads and creates a samtools view -c command for each region in the input bin. The dictionary returned from dxpy.download_all_inputs() is used to reference input names and paths.

def samtoolscount_bam(region_list, mappings_bam, index_file):
    """Processing function.

    Arguments:
        region_list (list[str]): Regions to count in BAM
        mappings_bam (dict): dxlink to input BAM
        index_file (dict): dxlink to input BAM

    Returns:
        Dictionary containing dxlinks to the uploaded read counts file
    """
    #
    # Download inputs
    # -------------------------------------------------------------------
    # dxpy.download_all_inputs will download all input files into
    # the /home/dnanexus/in directory.  A folder will be created for each
    # input and the file(s) will be download to that directory.
    #
    # In this example our dictionary inputs has the following key, value pairs
    # Note that the values are all list
    #     mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/<bam filename>.bam']
    #     mappings_bam_name: [u'<bam filename>.bam']
    #     mappings_bam_prefix: [u'<bam filename>']
    #     index_file_path: [u'/home/dnanexus/in/index_file/<bam filename>.bam.bai']
    #     index_file_name: [u'<bam filename>.bam.bai']
    #     index_file_prefix: [u'<bam filename>']
    #

    inputs = dxpy.download_all_inputs()

    # SAMtools view command requires the bam and index file to be in the same
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())
    shutil.move(inputs['index_file_path'][0], os.getcwd())
    input_bam = inputs['mappings_bam_name'][0]

    #
    # Per region perform SAMtools count.
    # --------------------------------------------------------------
    # Output count for regions and return DXLink as job output to
    # allow other entry points to download job output.
    #

    with open('read_count_regions.txt', 'w') as f:
        for region in region_list:
                view_cmd = create_region_view_cmd(input_bam, region)
                region_proc_result = run_cmd(view_cmd)
                region_count = int(region_proc_result[0])
                f.write("Region {0}: {1}\n".format(region, region_count))
    readcountDXFile = dxpy.upload_local_file("read_count_regions.txt")
    readCountDXlink = dxpy.dxlink(readcountDXFile.get_id())

    return {"readcount_fileDX": readCountDXlink}

This entry point returns {"readcount_fileDX": readCountDXlink}, a JBOR referencing an uploaded text file. This approach to scatter-gather stores the results in files and uploads/downloads the information as needed. This approach exaggerates a scatter-gather for tutorial purposes. You’re able to pass types other than file such as int.

combine_files

The main entry point triggers this subjob, providing the output of samtoolscount_bam as an input. This entry point gathers all the files generated by the samtoolscount_bam jobs and sums them.

def combine_files(countDXlinks, resultfn):
    """The 'gather' subjob of the applet.

    Arguments:
        countDXlinks (list[dict]): list of DXlinks to process job output files.
        resultfn (str): Filename to use for job output file.

    Returns:
        DXLink for the main function to return as the job output.

    Note: Only the DXLinks are passed as parameters.
    Subjobs work on a fresh instance so files must be downloaded to the machine
    """
    if resultfn.endswith(".bam"):
        resultfn = resultfn[:-4] + '.txt'

    sum_reads = 0
    with open(resultfn, 'w') as f:
        for i, dxlink in enumerate(countDXlinks):
            dxfile = dxpy.DXFile(dxlink)
            filename = "countfile{0}".format(i)
            dxpy.download_dxfile(dxfile, filename)
            with open(filename, 'r') as fsub:
                for line in fsub:
                    sum_reads += parse_line_for_readcount(line)
                    f.write(line)
        f.write('Total Reads: {0}'.format(sum_reads))

    countDXFile = dxpy.upload_local_file(resultfn)
    countDXlink = dxpy.dxlink(countDXFile.get_id())

    return {"countDXLink": countDXlink}

Important: While the main entry point triggers the processing and gathering entry points, keep in mind the main entry point doesn’t do any heavy lifting or processing. Notice in the .runSpec json above we start with a lightweight instance, scale up for the processing entry point, then finally scale down for the gathering step.

Parallel

Parallel by Chr (py)

This applet tutorial will perform a SAMtools count using parallel threads.

View full source code on GitHub

In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:

Install SAMtools
Download BAM file
Count regions in parallel

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

For additional information, please refer to the execDepends documentation.

Download BAM file

The dxpy.download_all_inputs() function downloads all input files into the /home/dnanexus/in directory. A folder will be created for each input and the file(s) will be downloaded to that directory. For convenience, the dxpy.download_all_inputs function returns a dictionary containing the following keys:

<var>_path (string): full absolute path to where the file was downloaded.
<var>_name (string): name of the file, including extention.
<var>_prefix (string): name of the file minus the longest matching pattern found in the dxapp.json I/O pattern field.

The path, name, and prefix key-value pattern is repeated for all applet file class inputs specified in the dxapp.json. In this example, our dictionary has the following key-value pairs:

{
    mappings_bam_path: [u'/home/dnanexus/in/mappings_bam/SRR504516.bam']
    mappings_bam_name: [u'SRR504516.bam']
    mappings_bam_prefix: [u'SRR504516']
    index_file_path: [u'/home/dnanexus/in/index_file/SRR504516.bam.bai']
    index_file_name: [u'SRR504516.bam.bai']
    index_file_prefix: [u'SRR504516']
}

    inputs = dxpy.download_all_inputs()
    shutil.move(inputs['mappings_bam_path'][0], os.getcwd())

Count Regions in Parallel

Before we can perform our parallel SAMtools count, we must determine the workload for each thread. We arbitrarily set our number of workers to 10 and set the workload per thread to 1 chromosome at a time. There are various ways to achieve multithreaded processing in python. For the sake of simplicity, we use multiprocessing.dummy, a wrapper around Python’s threading module.

    input_bam = inputs['mappings_bam_name'][0]

    bam_to_use = create_index_file(input_bam)
    print("Dir info:")
    print(os.listdir(os.getcwd()))

    regions = parseSAM_header_for_region(bam_to_use)

    view_cmds = [
        create_region_view_cmd(bam_to_use, region)
        for region
        in regions]

    print('Parallel counts')
    t_pools = ThreadPool(10)
    results = t_pools.map(run_cmd, view_cmds)
    t_pools.close()
    t_pools.join()

    verify_pool_status(results)

Each worker creates a string to be called in a subprocess.Popen call. We use the multiprocessing.dummy.Pool.map(<func>, <iterable>) function to call the helper function run_cmd for each string in the iterable of view commands. Because we perform our multithreaded processing using subprocess.Popen, we will not be alerted to any failed processes. We verify our closed workers in the verify_pool_status helper function.

def verify_pool_status(proc_tuples):
    err_msgs = []
    for proc in proc_tuples:
        if proc[2] != 0:
            err_msgs.append(proc[1])
    if err_msgs:
        raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))

Important: In this example we use subprocess.Popen to process and verify our results in verify_pool_status. In general, it is considered good practice to use python’s built-in subprocess convenience functions. In this case, subprocess.check_call would achieve the same goal.

Gather Results

Each worker returns a read count of just one region in the BAM file. We sum and output the results as the job output. We use the dx-toolkit python SDK’s dxpy.upload_local_file function to upload and generate a DXFile corresponding to our result file. For python, job outputs have to be a dictionary of key-value pairs, with the keys being job output names as defined in the dxapp.json and the values being the output values for corresponding output classes. For files, the output type is a DXLink. We use the dxpy.dxlink function to generate the appropriate DXLink value.

    resultfn = bam_to_use[:-4] + '_count.txt'
    with open(resultfn, 'w') as f:
        sum_reads = 0
        for res, reg in zip(results, regions):
            read_count = int(res[0])
            sum_reads += read_count
            f.write("Region {0}: {1}\n".format(reg, read_count))
        f.write("Total reads: {0}".format(sum_reads))

    count_file = dxpy.upload_local_file(resultfn)
    output = {}
    output["count_file"] = dxpy.dxlink(count_file)

    return output

Parallel by Region (py)

This applet tutorial will perform a SAMtools count using parallel threads.

View full source code on GitHub

In order to take full advantage of the scalability that cloud computing offers, our scripts have to implement the correct methodologies. This applet tutorial will:

Install SAMtools
Download BAM file
Split workload
Count regions in parallel

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends field.

{
  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

Download Inputs

This applet downloads all inputs at once using dxpy.download_all_inputs:

inputs = dxpy.download_all_inputs()
# download_all_inputs returns a dictionary that contains mapping from inputs to file locations.
# Additionaly, helper keys, value pairs are added to the dicitonary, similar to bash helper functions
inputs
#     mappings_sorted_bam_path: [u'/home/dnanexus/in/mappings_sorted_bam/SRR504516.bam']
#     mappings_sorted_bam_name: u'SRR504516.bam'
#     mappings_sorted_bam_prefix: u'SRR504516'
#     mappings_sorted_bai_path: u'/home/dnanexus/in/mappings_sorted_bai/SRR504516.bam.bai'
#     mappings_sorted_bai_name: u'SRR504516.bam.bai'
#     mappings_sorted_bai_prefix: u'SRR504516'

Split workload

We process in parallel using the python multiprocessing module using a rather simple pattern shown below:

print("Number of cpus: {0}".format(cpu_count()))  # Get cpu count from multiprocessing
worker_pool = Pool(processes=cpu_count())         # Create a pool of workers, 1 for each core
results = worker_pool.map(run_cmd, collection)    # map run_cmds to a collection
                                                  # Pool.map will handle orchestrating the job
worker_pool.close()
worker_pool.join()  # Make sure to close and join workers when done

This convenient pattern allows you to quickly orchestrate jobs on a worker. For more detailed overview of the multiprocessing module, visit the python docs.

We create several helpers in our applet script to manage our workload. One helper you may have seen before is run_cmd; we use this function to manage or subprocess calls:

def run_cmd(cmd_arr):
    """Run shell command.
    Helper function to simplify the pool.map() call in our parallelization.
    Raises OSError if command specified (index 0 in cmd_arr) isn't valid
    """
    proc = subprocess.Popen(
        cmd_arr,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE)
    stdout, stderr = proc.communicate()
    exit_code = proc.returncode
    proc_tuple = (stdout, stderr, exit_code)
    return proc_tuple

Before we can split our workload, we need to know what regions are present in our BAM input file. We handle this initial parsing in the parse_sam_header_for_region function:

def parse_sam_header_for_region(bamfile_path):
    """Helper function to match SN regions contained in SAM header

    Returns:
        regions (list[string]): list of regions in bam header
    """
    header_cmd = ['samtools', 'view', '-H', bamfile_path]
    print('parsing SAM headers:', " ".join(header_cmd))
    headers_str = subprocess.check_output(header_cmd).decode("utf-8")
    rgx = re.compile(r'SN:(\S+)\s')
    regions = rgx.findall(headers_str)
    return regions

Once our workload is split and we’ve started processing, we wait and review the status of each Pool worker. Then, we merge and output our results.

# Write results to file
resultfn = inputs['mappings_sorted_bam_name'][0]
resultfn = (
    resultfn[:-4] + '_count.txt'
    if resultfn.endswith(".bam")
    else resultfn + '_count.txt')
with open(resultfn, 'w') as f:
    sum_reads = 0
    for res, reg in zip(results, regions):
        read_count = int(res[0])
        sum_reads += read_count
        f.write("Region {0}: {1}\n".format(reg, read_count))
    f.write("Total reads: {0}".format(sum_reads))

count_file = dxpy.upload_local_file(resultfn)
output = {}
output["count_file"] = dxpy.dxlink(count_file)
return output

Note: The run_cmd function returns a tuple containing the stdout, stderr, and exit code of the subprocess call. We parse these outputs from our workers to determine whether the run failed or passed.

def verify_pool_status(proc_tuples):
    """
    Helper to verify worker succeeded.

    As failed commands are detected we write the stderr from that command
    to the job_error.json file. This file will be printed to the platform
    job log on App failure.
    """
    all_succeed = True
    err_msgs = []
    for proc in proc_tuples:
        if proc[2] != 0:
            all_succeed = False
            err_msgs.append(proc[1])
    if err_msgs:
        raise dxpy.exceptions.AppInternalError(b"\n".join(err_msgs))

Parallel by Region (sh)

This applet performs a basic SAMtools count on a series of sliced (by canonical chromosome) BAM files in parallel using wait.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools dependency is resolved by declaring an Apt-Get package in the dxapp.json runSpec.execDepends.

  "runSpec": {
    ...
    "execDepends": [
      {"name": "samtools"}
    ]
  }

Debugging

The command set -e -x -o pipefail will assist you in debugging this applet:

-e causes the shell to immediately exit if a command returns a non-zero exit code.
-x prints commands as they are executed, which is very useful for tracking the job’s status or pinpointing the exact execution failure.
-o pipefail makes the return code the first non-zero exit code. (Typically, the return code of pipes is the exit code of the last command, which can create difficult to debug problems.)
```
set -e -x -o pipefail
echo "Value of mappings_sorted_bam: '${mappings_sorted_bam}'"
echo "Value of mappings_sorted_bai: '${mappings_sorted_bai}'"

mkdir workspace
cd workspace
dx download "${mappings_sorted_bam}"

if [ -z "$mappings_sorted_bai" ]; then
  samtools index "$mappings_sorted_bam_name"
else
  dx download "${mappings_sorted_bai}"
fi
```
The *.bai file was an optional job input. You can check for an empty or unset var using the bash built-in test [[ - z ${var}} ]]. Then, you can download or create a *.bai index as needed.

Parallel Run

Bash’s job control system allows for easy management of multiple processes. In this example, you can run bash commands in the background as you control maximum job executions in the foreground. Place processes in the background using the character & after a command.

  chromosomes=$(samtools view -H "${mappings_sorted_bam_name}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

  for chr in $chromosomes; do
    samtools view -b "${mappings_sorted_bam_name}" "${chr}" -o "bam_${chr}.bam"
    echo "bam_${chr}.bam"
  done > bamfiles.txt

  busyproc=0
  while read -r b_file; do
    echo "${b_file}"
    if [[ "${busyproc}" -ge "$(nproc)" ]]; then
      echo Processes hit max
      while [[ "${busyproc}" -gt  0 ]]; do
        wait -n # p_id
        busyproc=$((busyproc-1))
      done
    fi
    samtools view -c "${b_file}"> "count_${b_file%.bam}" &
    busyproc=$((busyproc+1))
  done <bamfiles.txt

  while [[ "${busyproc}" -gt  0 ]]; do
    wait -n # p_id
    busyproc=$((busyproc-1))
  done

Job Output

Once the input bam has been sliced, counted, and summed, the output counts_txt is uploaded using the command dx-upload-all-outputs. The following directory structure required for dx-upload-all-outputs is below:

├── $HOME
│   ├── out
│       ├── < output name in dxapp.json >
│           ├── output file

In your applet, upload all outputs by:

  outputdir="${HOME}/out/counts_txt"
  mkdir -p "${outputdir}"
  cat count* | awk '{sum+=$1} END{print "Total reads = ",sum}' > "${outputdir}/${mappings_sorted_bam_prefix}_count.txt"

  dx-upload-all-outputs

Parallel xargs by Chr

This applet slices a BAM file by canonical chromosome then performs a parallelized samtools view -c using xargs. Type man xargs for general usage information.

View full source code on GitHub

How is the SAMtools dependency provided?

The SAMtools compiled binary is placed directory in the <applet dir>/resources directory. Any files found in the resources/ directory will be uploaded so that they will be present in the root directory of the worker. In our case:

├── Applet dir
│   ├── src
│   ├── dxapp.json
│   ├── resources
│       ├── usr
│           ├── bin
│               ├── < samtools binary >

When this applet is run on a worker, the resources/ folder will be placed in the worker’s root directory /:

/
├── usr
│   ├── bin
│       ├── < samtools binary >
├── home
│   ├── dnanexus

/usr/bin is part of the $PATH variable, so in our script, we can reference the samtools command directly, as in samtools view -c ...

Parallel Run

Splice BAM

First, we download our BAM file and slice it by canonical chromosome, writing the *bam file names to another file.

In order to split a BAM by regions, we need to have a *.bai index. You can either create an app(let) which takes the *.bai as an input or generate a *.bai in the applet. In this tutorial, we generate the *.bai in the applet, sorting the BAM if necessary.

  dx download "${mappings_bam}"

  indexsuccess=true
  bam_filename="${mappings_bam_name}"
  samtools index "${mappings_bam_name}" || indexsuccess=false
  if [[ $indexsuccess == false ]]; then
    samtools sort -o "${mappings_bam_name}" "${mappings_bam_name}"
    samtools index "${mappings_bam_name}"
    bam_filename="${mappings_bam_name}"
  fi


  chromosomes=$(samtools view -H "${bam_filename}" | grep "\@SQ" | awk -F '\t' '{print $2}' | awk -F ':' '{if ($2 ~ /^chr[0-9XYM]+$|^[0-9XYM]/) {print $2}}')

  for chr in $chromosomes; do
    samtools view -b "${bam_filename}" "${chr}" -o "bam_${chr}."bam
    echo "bam_${chr}.bam"
  done > bamfiles.txt

Xargs SAMtools view

In the previous section, we recorded the name of each sliced BAM file into a record file. Now we will perform a samtools view -c on each slice using the record file as input.

  counts_txt_name="${mappings_bam_prefix}_count.txt"

  sum_reads=$(<bamfiles.txt xargs -I {} samtools view -c $view_options '{}' | awk '{s+=$1} END {print s}')
  echo "Total Count: ${sum_reads}" > "${counts_txt_name}"

Upload results

The results file is uploaded using the standard bash process:

Upload a file to the job execution’s container.

Provide the DNAnexus link as a job’s output using the script dx-jobutil-add-output <output name>

  counts_txt_id=$(dx upload "${counts_txt_name}" --brief)
  dx-jobutil-add-output counts_txt "${counts_txt_id}" --class=file