Data Catalog Loader

circle-info

A license is required to use the Omics Data Catalog on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

The Data Catalog Loaderarrow-up-right app ingests, updates, and manages structured metadata in the Omics Data Catalog. This app transforms CSV files and Illumina Sample Sheets into searchable metadata records that integrate with the DNAnexus Platform.

You can also use the Omics Data Catalog API directly to programmatically manage metadata records.

Overview

The Data Catalog Loader enables organizations to create structured, searchable metadata catalogs that span multiple research projects. Unlike traditional data ingestion tools that create isolated datasets, this app builds interconnected metadata networks that support organization-wide discovery while maintaining project-based access controls.

Overview of all required and optional file inputs for the Data Catalog Loader app

When to Use This App

Use the Data Catalog Loader when you need to:

  • Create consistent metadata standards across research projects to harmonize data from multiple studies.

  • Make data findable across your organization without compromising access controls for cross-project discovery.

  • Connect entities, such as samples, assays, or analysis results, in meaningful relationships through research workflows.

Relationship to Other Ingestion Tools

Tool
Purpose
Use Case

Data Catalog Loader

Metadata management and discovery

Cross-project metadata catalogs

Data Model Loader

Phenotypic data ingestion

Creating Apollo Datasets for analysis workflows

Molecular Expression Assay Loader

Omics data ingestion

Ingesting expression matrices for computational analysis

How to Use the App

Using the UI

To use the Data Catalog Loader within the DNAnexus Platform:

  1. In the DNAnexus Platform, go to Tools > Tools Libraryarrow-up-right.

  2. For the Data Catalog Loaderarrow-up-right app, click Run Latest Version.

  3. In Output to, select a project and output location for the app's outputs.

    circle-info

    The project specified in Output to is where the job runs and is billed. You can specify a different project for storing the ingested metadata by using the Project ID optional input.

  4. Click Next.

  5. In the Inputs tab, specify the inputs.

  6. Click Start Analysis.

Using the CLI

To use the Data Catalog Loader from the command-line interface, install the DNAnexus Platform SDK.

circle-check

Use the following command format, customizing the input parameters for your specific data:

Using the Data Catalog Loader App

To ingest metadata using the Data Catalog Loader app, you need CONTRIBUTE permissions or higher in the target project. For protected projects, ADMINISTER permissions are required to modify metadata.

For detailed technical information about file formats, advanced configuration options, and troubleshooting, refer to the Data Catalog Loader app documentationarrow-up-right available in the DNAnexus Platform.

Metadata Updates

When updating existing metadata, the Data Catalog Loader follows these rules:

  • If multiple CSVs with the same entity ID contain records with the same ID, values from later files override values from earlier files, and only the final occurrence is stored in the catalog (unless the job fails during processing).

  • To update metadata already in the catalog, upload a CSV with the same entity ID and the same record IDs. New values automatically overwrite existing metadata without requiring deletion first.

Inputs

The app requires these key inputs to create or modify metadata in a data catalog:

  • Data CSVs - CSV files or Illumina Sample Sheets containing structured metadata. For details on formatting the input metadata files and handling values such as nulls, empty cells, and missing fields, refer to the Data Catalog Loader app documentationarrow-up-right.

  • File template - Specifies the format of the files provided in Data CSVs. Can be either Metadata CSV (default) or Illumina Sample Sheet V2.

Optional Configuration

Choose from these processing modes based on your workflow needs:

  • Project ID - Specify a different project for metadata ingestion than where the job runs. For example, you can run jobs in one project for organizational billing while storing metadata in another project to control visibility through that project's access permissions.

  • Mapping JSON file - Translation between your data and organizational schema

  • Delete specified records? - Remove specified records from the catalog

  • Should this app run in strict mode? - Enforce complete schema compliance during validation

  • Execute the job as a Dry Run? - Validate metadata structure and content without making changes

  • Should the app provide verbose output? - Generate detailed processing logs for troubleshooting

Process

The app processes your metadata through these conceptual steps:

  1. Schema Alignment: Data structure is validated against the defined schema.

    • Files are mapped to entities based on file names (if using Metadata CSV) or file structure (if using Illumina Sample Sheet V2).

    • File content is mapped to fields of the corresponding entities. The user-provided mapping specified in Mapping JSON file can help with this step.

  2. Validation: Expected data types and relationships are determined and validated.

  3. Data Catalog Modification: The changes are applied to the data catalog through the data catalog API.

Outputs

  • Processing Reports - Detailed status information and error diagnostics

  • Searchable Metadata - Records become discoverable through the Omics Data Catalog

Best Practices

Organizational Coordination

  • Collaborate with data administrators to align metadata ingestion with organizational schema and governance policies.

  • Work with DNAnexus Support for schema changes and data migration strategies when planning for schema evolution.

  • Develop consistent terminology and formatting conventions across research teams to establish metadata standards.

Workflow Integration

  • Use the dry run mode to validate new metadata sources and organizational standards before committing changes.

  • Ensure appropriate permissions for metadata visibility and collaboration requirements by coordinating project access.

  • Review processing reports and establish ongoing quality assurance processes to monitor data quality.

Platform Integration

  • Design metadata structure to support downstream analysis and reporting needs for seamless workflow integration.

  • Structure metadata to enable effective cross-project search and filtering to optimize discovery patterns.

  • Use entity relationships to track data lineage and research workflow connections for comprehensive data provenance.

Last updated

Was this helpful?