1 of 2

Dataset Extender

Learn to use Dataset Extender, which allows you to expand a core Apollo dataset, then access the newly added data.

An Apollo license is required to use Dataset Extender on the DNAnexus Platform. Org approval may also be required. for more information.

Overview

Using Dataset Extender

Common usage patterns for the Dataset Extender app.

An Apollo license is required to use Dataset Extender on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Adding Derived Phenotypes to an Existing Entity

Identify your dataset to extend. If you are using the command line, ensure that you retrieve the record id
To add data to an existing entity, ensure the following conditions are met
1. The data is related to the entity in a one-to-one relationship
2. The data has the unique keys for the entity you are extending, preferably in the first column
3. Your column names do not overlap with any of the column names in the entity you are extending (excluding the column key, those can overlap).
Save the data as a file in your project. It is recommended you save as comma delimited, but tab delimited is also supported with an extra input configuration.
Run the with the following inputs
1. Source Data - This should be set to your data file
2. Target Dataset - This is the dataset you want to extend
This process generates:
1. A new dataset with the original data plus your new data
2. A new database if the original database cannot be written to

Identify the dataset you want to extend. If you are using the command line, ensure that you retrieve the record id
To add data as a new entity, ensure the following conditions are met
1. The data is related to the entity in a one-to-one or many-to-one relationship

Using Dataset Extender

Common usage patterns for the Dataset Extender app.

An Apollo license is required to use Dataset Extender on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Adding Derived Phenotypes to an Existing Entity

Identify your dataset to extend. If you are using the command line, ensure that you retrieve the record id
To add data to an existing entity, ensure the following conditions are met
1. The data is related to the entity in a one-to-one relationship
2. The data has the unique keys for the entity you are extending, preferably in the first column
3. Your column names do not overlap with any of the column names in the entity you are extending (excluding the column key, those can overlap).
Save the data as a file in your project. It is recommended you save as comma delimited, but tab delimited is also supported with an extra input configuration.
Run the with the following inputs
1. Source Data - This should be set to your data file
2. Target Dataset - This is the dataset you want to extend
This process generates:
1. A new dataset with the original data plus your new data
2. A new database if the original database cannot be written to

Identify the dataset you want to extend. If you are using the command line, ensure that you retrieve the record id
To add data as a new entity, ensure the following conditions are met
1. The data is related to the entity in a one-to-one or many-to-one relationship

Run the Dataset Extender application with the following inputs

Source Data - This should be set to your data file
Target Dataset - This is the dataset you want to extend
Build New Entity - This needs to be changed to true
New Entity Name - The name of the new entity you are creating. This cannot overlap with any other entity title in the Target Dataset
Target Entity Name - Only specify this if you are extending an entity that is not the main entity
Source Data Delimiter - Select "/t" if you are using a TSV. The default is "," comma.
When running through dx-toolkit, you can use a pattern as follows:
dx run dataset-extender -isource_data=<file path> -itarget_dataset=<record id> -ibuild_new_entity=true -inew_entity_name=<entity name> -itarget_entity_name=<entity title the data relates to>
For additional configuration guidance refer to the page

The primary target of this application is for ingestion of analysis results or newly derived phenotypes.
The application allows the user to ingest raw data and have the system automatically type cast, build categorically, and link the data with the core data, even across multiple datasets.
The result of the application is:
- The new data is ingested into the Apollo database.
- A new dataset is created with access to previously ingested data and the newly extended datasets.
The application is not meant as a permanent expansion to core data given that it has limited configurations over how data is ingested, and is intended to help grow a team's dataset with minimal effort.
- Expansion of a core, controlled dataset is meant to be performed using the to allow for greater control over system interpretation of data.
- Each run of the Dataset Extender app results in a new raw table being created. Heavy use on the same growing dataset can lead to degraded performance.

Source Data - a delimited (CSV, TSV) or gzip delimited file that contains the dataset to extend the data with. The file must be no larger than 400 columns by 10,000,000 rows and must have a header.
Target Dataset Record - the dataset that acts as the dataset to be extended.
Instance Type - while a default should be sufficient for most small to medium datasets, if input files large, ensure that the instance is increased to help efficient complete the process.

Output Dataset Name - name of the dataset that is created that includes the newly ingested Source Data.
Database name - the name of the database to use (or create) if the data is to be written to a database that's different than the main database used in the Target Dataset Record. If this is left blank and the main database used in the Target Dataset Record is not writeable, a new database is automatically created and named db_<epoch>_<uuid>.
Table Name - the table name to which the Source Data is written. If left blank, the database name of the Source Data file is used as the table name. If provided, ensure that the table name is unique for the database.
Target Entity Name - the entity which the source data is linked to. If left blank, the data is linked to the main entity of the dataset. The entity that is to be joined to must contain a local (or global) unique key.
Join Relationship - how the Source Data joins to the Target Entity. By default this is automatic, but a relationship of one-to-one or many-to-one can be forced.
Build New Entity - when set to False (default), the Target Entity is extended. If the Source Data does not have a one-to-one relationship with the Target Entity, this leads to an error. When set to True, a new entity is added to the dataset.
New Entity Name - the logical entity name of the Source Data. If left empty, the entity name is <target_entity_name>_extended_<epoch>. The entity name must be unique for the dataset.
Folder Path - the folder path shown in the Cohort Browser field explorer. If left empty, the folder path is the New Entity Name. You can nest the new fields being created in an existing folder using the same notation used for . For example, you can use a path like "Main Folder>Nested Folder".
Source Data Delimiter - how the Source Data is delimited. By default this is set to comma (",") but this can be adjusted to tab ("\t").
Infer Categorical Code - a setting that converts a string field to a . This is the maximum number of distinct codes to treat as categorical. If this value is set to 0, no coding is inferred.
See app documentation for further granular configurations.

Integer
Float
Date
- yyyy-[m]m-[d]d
DateTime
- yyyy-[m]m-[d]d
- yyyy-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]
String Categorical
- This is detected based on the Infer Categorical Code input.

Ensure that your column headers are database friendly and do not include special characters or spaces.
When extending an entity, ensure that your column names are unique.
Adjust the Infer Categorical Code based on the number of rows you are adding in your extension. If you are extending and adding 500,000 rows, you likely want the number higher.
For best performance, aggregate your planned extensions together so that you are adding multiple columns at a time versus running the app multiple times and adding only 1 column at a time.
Ensure that the delimiter set matches your file, the default is comma.
When specifying a target entity, ensure you are using the entity title and not the display name available in the Cohort Browser.

Dataset Extender

Overview

Using Dataset Extender

Adding Derived Phenotypes to an Existing Entity

Using Dataset Extender

Adding Derived Phenotypes to an Existing Entity

Dataset Extender

Overview

Using the Dataset Extender App

Launching the App

Inputs

Process

Outputs

Automatically Detected Data Types

Best Practices

Dataset Extender

Overview

Using Dataset Extender

Adding Derived Phenotypes to an Existing Entity

Supplementing a Dataset by Adding a New, Related Entity

Using Dataset Extender

Adding Derived Phenotypes to an Existing Entity

Supplementing a Dataset by Adding a New, Related Entity

Dataset Extender

Overview

Using the Dataset Extender App

Launching the App

Inputs

Process

Outputs

Automatically Detected Data Types

Best Practices