Dataset Extender

Note: This is intended only for customers with a DNAnexus Apollo license and Org approval (if applicable). Contact [email protected] for more information.

dx run dataset-extender (use -h for help)

The Dataset Extender application is an application meant to help expand your core dataset so that you and the entire team can access newly generated data. It is a lightweight app focused on quickly expanding core datasets with newly generated or acquired data that is to be shared with collaborators.

A representation of the dataset extender's goal: to add new data to existing datasets.
  • The primary target of this application is for ingestion of analysis results or newly derived phenotypes.

  • The application allows the user to ingest raw data and have the system automatically type cast, build categorically, and link the data with the core data, even across multiple datasets.

  • The result of the application is:

    • The new data is ingested into the Spark database.

    • A new dataset is created with access to previously ingested data and the newly extended datasets.

  • The application is not meant as a permanent expansion to core data given that it has limited configurations over how data is ingested, and is intended to help grow a team's dataset with minimal effort.

    • Expansion of a core, controlled dataset is meant to be performed using the Data Model Loader to allow for greater control over system interpretation of data.

    • Note that each run of the Dataset Extender app does result in a new raw table being created so heavy use on the same growing dataset can lead to degraded performance.

Overview

Overview of all file inputs for the Dataset Extender app

Inputs

The Dataset Extender app requires as an input:

  • Source Data - a delimited (CSV, TSV) or gzip delimited file that contains the dataset to extend the data with. Note that this file can be no larger than 400 columns by 10,000,000 rows and the file must have a header.

  • Target Dataset Record - the dataset that will act as the dataset to be extended.

  • Instance Type - while a default should be sufficient for most small to medium datasets, if input files large, ensure that the instance is increased to help efficient complete the process.

Additional Optional Inputs are:

  • Output Dataset Name - name of the dataset that will be created that includes the newly ingested Source Data.

  • Database name - the name of the database to use (or create) if the data is to be written to a database that's different than the main database used in the Target Dataset Record. Note that if this is left blank and the main database used in the Target Dataset Record is not writeable, a new database will be automatically created and named db_<epoch>_<uuid>

  • Table Name - the tablename to which the Source Data will be written. If left blank, the basename of the Source Data file is used as the table name. If provided, ensure that the tablename is unique for the database.

  • Target Entity Name - the entity which the source data will be linked to. If left blank, the data will be linked to the main entity of the dataset. The entity that is to be joined to must contain a local (or global) unique key.

  • Join Relationship - how the Source Data joins to the Target Entity. By default this is automatic, but a relationship of one-to-one or many-to-one can be forced.

  • Build New Entity - when set to False (default), the Target Entity is extended. If the Source Data does not have a one-to-one relationship with the Target Entity, this will lead to an error. When set to True, a new entity will be added to the dataset.

  • New Entity Name - the logical entity name of the Source Data. If left empty, the entity name will be <target_entity_name>_extended_<epoch>. Note that the entity name must be unique for the dataset.

  • Folder Path - the folder path shown in the Cohort Browser field explorer. If left empty, the folder path will be the New Entity Name. Note that you can nest the new fields being created in an existing folder using the same notation used for data loading (e.g. "Main Folder>Nested Folder").

  • Source Data Delimiter - how the Source Data is delimited. By default this is set to comma (",") but this can be adjusted to tab ("\t").

  • Infer Categorical Code - a setting that converts a string field to a string categorical. This is the maximum number of distinct codes to treat as categorical. If this value is set to 0, no coding will be inferred.

  • See app documentation for further granular configurations.

Process

  1. The Dataset Extender app loads the source data into a Spark table.

  2. The app configurations are used to automatically generate dictionary information and the coding information (if Infer Categorical Code is > 0).

  3. From the input target dataset, the app joins the dataset with the newly ingested data and generates a novel dataset with the combined information.

Outputs

  • Database - the ID of the database to which the Source Data was written.

  • Dataset - the dataset record created.

  • Logs - available under Project: .csv-loader/<job-id>-clusterlogs.tar.gz.

    • Spark cluster logs - for advanced troubleshooting.

Data Types Auto Detected

  • Integer

  • Float

  • Date

    • yyyy-[m]m-[d]d

  • DateTime

    • yyyy-[m]m-[d]d

    • yyyy-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]

    • yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]

    • yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]Z

    • yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]-[h]h:[m]m

    • yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m

  • String Categorical

    • This is detected based on the Infer Categorical Code input.

Best Practices

  1. Ensure that your column headers is database friendly and does not include special characters or spaces.

  2. When extending an entity, ensure that your column names are unique.

  3. Adjust the Infer Categorical Code based on the number of rows you are adding in your extension. If you are extending and adding 500,000 rows, you likely want the number higher.

  4. For best performance, aggregate your planned extensions together so that you are adding multiple columns at a time versus running the app multiple times and adding just 1 column at a time.

  5. Ensure that the delimiter set matches your file, the default is comma.

  6. When specifying a target entity, ensure you are using the entity title and not the display name available in the cohort browser.