Dataset Extender

Note: This is intended only for customers with a DNAnexus Apollo license and Org approval (if applicable). Contact [email protected] for more information.

dx run dataset-extender (use -h for help)

The Dataset Extender application is an application meant to help expand your core dataset so that the entire team can access newly generated data. It is a lightweight app focused on quickly expanding core datasets with newly generated or acquired data that is to be shared with collaborators.

A representation of the dataset extender's goal: to add new data to existing datasets.

The primary target of this application is for ingestion of analysis results or newly derived phenotypes. Note that the application is not meant as a permanent expansion to core data given that it has limited configurations over how data is ingested and is more meant to help grow a team's dataset with minimal effort. This application allows the user to ingest raw data and the system automatically type casts, builds categorically, and links the data with the core data, even across multiple datasets. The result is data ingested into the Spark database and a new dataset with access to previously ingested data and the newly extended datasets. Data can be extended via one-to-one relationships or one-to-many relationships.

Expansion of a core, controlled dataset is meant to be performed using the Data Model Loader to allow for greater control over system interpretation of data.

Note that each run of the Dataset Extender app does result in a new raw table being created so heavy use on the same growing dataset can lead to degraded performance.

Overview

Overview of all file inputs for the Dataset Extender app

Inputs

The Dataset Extender app as requires as an input:

  • Source Data - a delimited (CSV, TSV) or gzip delimited file that contains the dataset to extend the data with. Note that this file can be no larger than 400 columns by 10,000,000 rows and the file must have a header.

  • Target Dataset Record - the dataset that will act as the dataset to be extended.

  • Instance Type - while a default should be sufficient for most small to medium datasets, if input files large, ensure that the instance is increased to help efficient complete the process.

Additional Optional Inputs are:

  • Input Configurations

    • Output Dataset Name - name of the dataset that will be created that includes the newly ingested Source Data

    • Database name - the name of the database to use (or create) if the data is to be written to a database that's different than the main database used in the Target Dataset Record. Note that if this is left blank and the main database used in the Target Dataset Record is not writeable, a new database will be automatically created and named 'db_<epoch>_<uuid>'

    • Table Name - the tablename to which the Source Data will be written. If left blank, the basename of the Source Data file is used as the table name. If provided, ensure that the tablename is unique for the database.

    • Target Entity Name - the entity which the source data will be linked to. If left blank, the data will be linked to the main entity of the dataset. The entity to be joined to must contain a local (or global) unique key.

    • Join Relationship - how the Source Data joins to the Target Entity. By default this is automatic but a relationship of one-to-one or many-to-one can be forced.

    • New Entity Name - the logical entity name of the Source Data. If left empty, the entity name will be '<target_entity_name>_extended_<epoch>'. Note that the entity name must be unique for the dataset

    • Folder Path - the folder path shown in the Cohort Browser field explorer. If left empty the folder bath will be the New Entity Name. Note that you can nest in an existing folder using the same notation used for data loading (e.g. "Main Folder>Nested Folder")

    • Source Data Delimiter - how the Source Data is delimited. By default this is set to comma (",") but this can be adjusted to tab ("\t")

    • Infer Categorical Code - a setting that converts a string field to a string categorical. This is the maximum number of distinct codes to treat as categorical. If this value is set to 0, no coding will be inferred.

    • See app documentation for further granular configurations.

Process

The Dataset Extender app first loads the source data into a Spark table. Once the data is loaded, the app configurations and are used to automatically generate dictionary information and the coding information (if Infer Categorical Code is > 0). Reading the input target dataset, the app reads joins the two sets of data and generates a novel dataset with the combined information.

Outputs

  • Database - the id of the database to which the Source Data was written.

  • Dataset - the dataset record created.

  • Logs - available under Project: .csv-loader/<job-id>-clusterlogs.tar.gz.

    • Spark cluster logs - for advanced troubleshooting.