Creating Multi-Assay Datasets

Learn how to create multi-assay datasets that combine different data types for comprehensive analysis.

An Apollo license is required to use Apollo Datasets on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Multi-assay datasets integrate data from diverse sources into a single, comprehensive dataset record on the DNAnexus Platform. This integration can be achieved in two ways:

  1. Combining different data types: This involves merging distinct datasets, such as a germline variants dataset, somatic variants dataset, and gene expression dataset, into a unified structure. Each of these is a complete collection of data for a specific assay.

  2. Combining multiple instances of the same assay type: This involves consolidating data from different experimental methods that produce the same type of information. For example, a single germline dataset might include WES and WGS data instances, both of which are specific datasets containing germline variants.

This integrated approach enables a more holistic analysis across different data modals and supports complex research queries that span multiple molecular data layers.

Why Create Multi-Assay Datasets?

Creating multi-assay datasets allows you to:

  • Integrate diverse data types - Combine clinical, genomic (germline and somatic), and molecular expression data for comprehensive patient profiles.

  • Compare methodologies - Include multiple assays of the same type, such as WES and WGS for germline variant datasets, to assess methodological differences.

  • Enable cross-modal analysis - Discover relationships between genomic variants and gene expression patterns, for example.

  • Streamline workflows - Access all relevant data through a single dataset interface.

Dataset Merging Process

With the Assay Dataset Merger app, you can create multi-assay datasets by sequentially merging one assay at a time. This ensures data integrity throughout the process, and provides granular control and flexibility in combining diverse datasets.

Key Principles of Merging

  1. Sequential merging - You can merge only two datasets at a time. This methodical approach gives you precise control as you build your final dataset.

  2. Target-Source Relationship - Each merge involves the Target and Source datasets. The Target dataset provides the main entity structure, such patient IDs. The Source dataset contributes a new assay. The entities from your Target dataset are always preserved in the merged result, ensuring your core relationships remain consistent.

  3. Preserved entity relationships - The main entity from the target dataset becomes the primary entity in the merged result.

  4. Flexible order - You can merge your assays in any order, though we recommend starting with clinical data as it provides a solid foundation for your patient profiles.

Sample Merging Workflow

The following diagram shows a multi-step data processing workflow. First, a "Clinical Dataset" and "Germline Variants Dataset" merge to create a combined dataset. Second, this new dataset merges with a "Somatic Variants Dataset." Third, the resulting dataset merges with a "Gene Expression Dataset" to produce the "Final Multi-Assay Dataset."

Step-by-Step Merging Process

1. Plan Your Strategy

Before you begin, plan how to combine your data.

  • Identify Your Core Dataset: Start with your primary dataset, which should include your main entities (patients or samples) along with clinical and phenotypic data.

  • Map Your Assays: Understand the relationships between your different assays and how they link back to your core entities.

  • Choose a Merging Approach:

    • Clinical-First (Recommended): Start with your clinical dataset as the target and add other assays sequentially.

    • Assay-First: Begin with a comprehensive assay dataset and then merge clinical data.

    • Modular: Create smaller, purpose-built multi-assay datasets for specific research questions.

Consider the relationships between your different assays:

2. Prepare Your Datasets

Ensure each dataset is properly formatted and loaded onto the platform using the appropriate tools.

Clinical Data: Use the Data Model Loader. Germline variants: Use the VCF ETL Orchestrator. Somatic Variants: Use the Somatic Variant Assay Loader. Gene Expression: Use the Molecular Expression Assay Loader.

3. Execute the Merges

Using the Assay Dataset Merger app, you can perform sequential merges. Use descriptive names for each output to keep track of your progress.

# Step 1: Merge clinical data with first germline assay
dx run assay_dataset_merger \
  -isource_dataset=germline_variants_wes \
  -itarget_dataset=clinical_dataset \
  -ioutput_dataset_name=clinical_plus_germline \
  -ilinking_database_name=linkage_db_01

# Step 2: Add somatic variants to the merged dataset
dx run assay_dataset_merger \
  -isource_dataset=somatic_variants_primary_tumor \
  -itarget_dataset=clinical_plus_germline \
  -ioutput_dataset_name=clinical_germline_somatic \
  -ilinking_database_name=linkage_db_02

# Step 3: Add gene expression data
dx run assay_dataset_merger \
  -isource_dataset=gene_expression_dataset \
  -itarget_dataset=clinical_germline_somatic \
  -ioutput_dataset_name=complete_multi_assay_dataset \
  -ilinking_database_name=linkage_db_03

Step 4: Validate Your Merged Dataset

After each merge, it's crucial to validate the result.

  • Check Entity Counts: Make sure you have the expected number of patients or samples.

  • Confirm Data Accessibility: Verify that all assays are visible and accessible through the Cohort Browser.

  • Validate Linkages: Ensure proper relationships between entities and assays.

  • Review Metadata: Ensure assay names and descriptions are clear.

Using Your Multi-Assay Dataset

Once you have created your multi-assay dataset, you can explore, define, and analyze it using the Cohort Browser, and discover new relationships across your different data types with operations like:

Last updated

Was this helpful?