All pages
Powered by GitBook
1 of 2

Loading...

Loading...

Using Dataset Extender

Common usage patterns for the Dataset Extender app.

An Apollo license is required to use Dataset Extender on the DNAnexus Platform. Org approval may also be required. Contact DNAnexus Sales for more information.

Adding Derived Phenotypes to an Existing Entity

  1. Identify your dataset to extend. If you are using the command line, ensure that you retrieve the record id

  2. To add data to an existing entity, ensure the following conditions are met

    1. The data is related to the entity in a one-to-one relationship

    2. The data has the unique keys for the entity you are extending, preferably in the first column

    3. Your column names do not overlap with any of the column names in the entity you are extending (excluding the column key, those can overlap).

  3. Save the data as a file in your project. It is recommended you save as comma delimited, but tab delimited is also supported with an extra input configuration.

  4. Run the with the following inputs

    1. Source Data - This should be set to your data file

    2. Target Dataset - This is the dataset you want to extend

  5. This process generates:

    1. A new dataset with the original data plus your new data

    2. A new database if the original database cannot be written to

Supplementing a Dataset by Adding a New, Related Entity

  1. Identify the dataset you want to extend. If you are using the command line, ensure that you retrieve the record id

  2. To add data as a new entity, ensure the following conditions are met

    1. The data is related to the entity in a one-to-one or many-to-one relationship

Target Entity Name -
Only specify this if you are extending an entity that is not the main entity
  • Source Data Delimiter - Select "/t" if you are using a TSV. The default is "," comma.

  • When running through dx-toolkit, you can use a pattern as follows:

    dx run dataset-extender -isource_data=<file path> -itarget_dataset=<record id>

  • For additional configuration guidance refer to the Dataset Extender page

  • The data has the a column with values that correspond to the keys for the entity you are extending, preferably this is in the first column
  • Save the data as a file in your project. It is recommended you save as comma delimited, but tab delimited is also supported with an extra input configuration.

  • Run the Dataset Extender application with the following inputs

    1. Source Data - This should be set to your data file

    2. Target Dataset - This is the dataset you want to extend

    3. Build New Entity - This needs to be changed to true

    4. New Entity Name - The name of the new entity you are creating. This cannot overlap with any other entity title in the Target Dataset

    5. Target Entity Name - Only specify this if you are extending an entity that is not the main entity

    6. Source Data Delimiter - Select "/t" if you are using a TSV. The default is "," comma.

    7. When running through dx-toolkit, you can use a pattern as follows:

      dx run dataset-extender -isource_data=<file path> -itarget_dataset=<record id> -ibuild_new_entity=true -inew_entity_name=<entity name> -itarget_entity_name=<entity title the data relates to>

    8. For additional configuration guidance refer to the page

  • This process generates:

    1. A new dataset with the original data plus your new data

    2. A new database if the original database cannot be written to

  • Dataset Extender application

    Dataset Extender

    Learn to use Dataset Extender, which allows you to expand a core Apollo dataset, then access the newly added data.

    An Apollo license is required to use Dataset Extender on the DNAnexus Platform. Org approval may also be required. for more information.

    Overview

    Dataset Extender
    The Dataset Extender application is an application meant to help expand your core dataset so that you and the entire team can access newly generated data. It is a lightweight app focused on quickly expanding core datasets with newly generated or acquired data that is to be shared with collaborators.
    • The primary target of this application is for ingestion of analysis results or newly derived phenotypes.

    • The application allows the user to ingest raw data and have the system automatically type cast, build categorically, and link the data with the core data, even across multiple datasets.

    • The result of the application is:

      • The new data is ingested into the Apollo database.

      • A new dataset is created with access to previously ingested data and the newly extended datasets.

    • The application is not meant as a permanent expansion to core data given that it has limited configurations over how data is ingested, and is intended to help grow a team's dataset with minimal effort.

      • Expansion of a core, controlled dataset is meant to be performed using the to allow for greater control over system interpretation of data.

      • Each run of the Dataset Extender app results in a new raw table being created. Heavy use on the same growing dataset can lead to degraded performance.

    Using the Dataset Extender App

    Launching the App

    To launch the Dataset Extender app, enter this command via the command line:

    dx run dataset-extender

    Inputs

    The Dataset Extender app requires as an input:

    • Source Data - a delimited (CSV, TSV) or gzip delimited file that contains the dataset to extend the data with. The file must be no larger than 400 columns by 10,000,000 rows and must have a header.

    • Target Dataset Record - the dataset that acts as the dataset to be extended.

    • Instance Type - while a default should be sufficient for most small to medium datasets, if input files large, ensure that the instance is increased to help efficient complete the process.

    Additional Optional Inputs are:

    • Output Dataset Name - name of the dataset that is created that includes the newly ingested Source Data.

    • Database name - the name of the database to use (or create) if the data is to be written to a database that's different than the main database used in the Target Dataset Record. If this is left blank and the main database used in the Target Dataset Record is not writeable, a new database is automatically created and named db_<epoch>_<uuid>.

    • Table Name - the table name to which the Source Data is written. If left blank, the database name of the Source Data file is used as the table name. If provided, ensure that the table name is unique for the database.

    • Target Entity Name - the entity which the source data is linked to. If left blank, the data is linked to the main entity of the dataset. The entity that is to be joined to must contain a local (or global) unique key.

    • Join Relationship - how the Source Data joins to the Target Entity. By default this is automatic, but a relationship of one-to-one or many-to-one can be forced.

    • Build New Entity - when set to False (default), the Target Entity is extended. If the Source Data does not have a one-to-one relationship with the Target Entity, this leads to an error. When set to True, a new entity is added to the dataset.

    • New Entity Name - the logical entity name of the Source Data. If left empty, the entity name is <target_entity_name>_extended_<epoch>. The entity name must be unique for the dataset.

    • Folder Path - the folder path shown in the Cohort Browser field explorer. If left empty, the folder path is the New Entity Name. You can nest the new fields being created in an existing folder using the same notation used for . For example, you can use a path like "Main Folder>Nested Folder".

    • Source Data Delimiter - how the Source Data is delimited. By default this is set to comma (",") but this can be adjusted to tab ("\t").

    • Infer Categorical Code - a setting that converts a string field to a . This is the maximum number of distinct codes to treat as categorical. If this value is set to 0, no coding is inferred.

    • See app documentation for further granular configurations.

    Process

    1. The Dataset Extender app loads the source data into a Spark table.

    2. The app configurations are used to automatically generate dictionary information and the coding information (if Infer Categorical Code is > 0).

    3. From the input target dataset, the app joins the dataset with the newly ingested data and generates a novel dataset with the combined information.

    Outputs

    • Database - the ID of the database to which the Source Data was written.

    • Dataset - the dataset record created.

    • Logs - available under Project: .csv-loader/<job-id>-clusterlogs.tar.gz.

      • Spark cluster logs - for advanced troubleshooting.

    Automatically Detected Data Types

    • Integer

    • Float

    • Date

      • yyyy-[m]m-[d]d

    • DateTime

      • yyyy-[m]m-[d]d

      • yyyy-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]

    • String Categorical

      • This is detected based on the Infer Categorical Code input.

    Best Practices

    1. Ensure that your column headers are database friendly and do not include special characters or spaces.

    2. When extending an entity, ensure that your column names are unique.

    3. Adjust the Infer Categorical Code based on the number of rows you are adding in your extension. If you are extending and adding 500,000 rows, you likely want the number higher.

    4. For best performance, aggregate your planned extensions together so that you are adding multiple columns at a time versus running the app multiple times and adding only 1 column at a time.

    5. Ensure that the delimiter set matches your file, the default is comma.

    6. When specifying a target entity, ensure you are using the entity title and not the display name available in the Cohort Browser.

    Contact DNAnexus Sales
    Overview of all file inputs for the Dataset Extender app
    yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]
  • yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z

  • yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m

  • yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m

  • Data Model Loader
    data loading
    string categorical