Data Catalog Loader

A license is required to use the Omics Data Catalog on the DNAnexus Platform. Contact DNAnexus Sales for more information.

The Data Catalog Loader app ingests, updates, and manages structured metadata in the Omics Data Catalog. This app transforms CSV files and Illumina Sample Sheets into searchable metadata records that integrate with the DNAnexus Platform.

You can also use the Omics Data Catalog API directly to programmatically manage metadata records.

Overview

The Data Catalog Loader enables organizations to create structured, searchable metadata catalogs that span multiple research projects. Unlike traditional data ingestion tools that create isolated datasets, this app builds interconnected metadata networks that support organization-wide discovery while maintaining project-based access controls.

When to Use This App

Use the Data Catalog Loader when you need to:

Create consistent metadata standards across research projects to harmonize data from multiple studies.
Make data findable across your organization without compromising access controls for cross-project discovery.
Connect entities, such as samples, assays, or analysis results, in meaningful relationships through research workflows.

Relationship to Other Ingestion Tools

Tool

Purpose

Use Case

Data Catalog Loader

Metadata management and discovery

Cross-project metadata catalogs

Data Model Loader

Phenotypic data ingestion

Creating Apollo Datasets for analysis workflows

Molecular Expression Assay Loader

Omics data ingestion

Ingesting expression matrices for computational analysis

How to Use the App

Using the UI

To use the Data Catalog Loader within the DNAnexus Platform:

In the DNAnexus Platform, go to Tools > Tools Library.
For the Data Catalog Loader app, click Run Latest Version.
In Output to, select a project and output location for the app's outputs.
The project specified in Output to is where the job runs and is billed. You can specify a different project for storing the ingested metadata by using the Project ID optional input.
Click Next.
In the Inputs tab, specify the inputs.
Click Start Analysis.

Using the CLI

To use the Data Catalog Loader from the command-line interface, install the DNAnexus Platform SDK.

When using the DNAnexus Platform through Cloud Workspace or JupyterLab, the DNAnexus SDK is preinstalled. You can use the dx command right away.

Use the following command format, customizing the input parameters for your specific data:

dx run data_catalog_loader \
  -i data_csvs='["new_samples_sample.csv", "new_files_data_object.csv"]' \
  -i file_template="metadata_csv" \
  -i project_id="project-xxxx"

Using the Data Catalog Loader App

To ingest metadata using the Data Catalog Loader app, you need CONTRIBUTE permissions or higher in the target project. For protected projects, ADMINISTER permissions are required to modify metadata.

For detailed technical information about file formats, advanced configuration options, and troubleshooting, refer to the Data Catalog Loader app documentation available in the DNAnexus Platform.

Metadata Updates

When updating existing metadata, the Data Catalog Loader follows these rules:

If multiple CSVs with the same entity ID contain records with the same ID, values from later files override values from earlier files, and only the final occurrence is stored in the catalog (unless the job fails during processing).
To update metadata already in the catalog, upload a CSV with the same entity ID and the same record IDs. New values automatically overwrite existing metadata without requiring deletion first.

Inputs

The app requires these key inputs to create or modify metadata in a data catalog:

Data CSVs - CSV files or Illumina Sample Sheets containing structured metadata. For details on formatting the input metadata files and handling values such as nulls, empty cells, and missing fields, refer to the Data Catalog Loader app documentation.
File template - Specifies the format of the files provided in Data CSVs. Can be either Metadata CSV (default) or Illumina Sample Sheet V2.

Optional Configuration

Choose from these processing modes based on your workflow needs:

Project ID - Specify a different project for metadata ingestion than where the job runs. For example, you can run jobs in one project for organizational billing while storing metadata in another project to control visibility through that project's access permissions.
Mapping JSON file - Translation between your data and organizational schema
Delete specified records? - Remove specified records from the catalog
Should this app run in strict mode? - Enforce complete schema compliance during validation
Execute the job as a Dry Run? - Validate metadata structure and content without making changes
Should the app provide verbose output? - Generate detailed processing logs for troubleshooting

Process

The app processes your metadata through these conceptual steps:

Schema Alignment: Data structure is validated against the defined schema.
- Files are mapped to entities based on file names (if using Metadata CSV) or file structure (if using Illumina Sample Sheet V2).
- File content is mapped to fields of the corresponding entities. The user-provided mapping specified in Mapping JSON file can help with this step.
Validation: Expected data types and relationships are determined and validated.
Data Catalog Modification: The changes are applied to the data catalog through the data catalog API.

Outputs

Processing Reports - Detailed status information and error diagnostics
Searchable Metadata - Records become discoverable through the Omics Data Catalog

Best Practices

Organizational Coordination

Collaborate with data administrators to align metadata ingestion with organizational schema and governance policies.
Work with DNAnexus Support for schema changes and data migration strategies when planning for schema evolution.
Develop consistent terminology and formatting conventions across research teams to establish metadata standards.

Workflow Integration

Use the dry run mode to validate new metadata sources and organizational standards before committing changes.
Ensure appropriate permissions for metadata visibility and collaboration requirements by coordinating project access.
Review processing reports and establish ongoing quality assurance processes to monitor data quality.

Platform Integration

Design metadata structure to support downstream analysis and reporting needs for seamless workflow integration.
Structure metadata to enable effective cross-project search and filtering to optimize discovery patterns.
Use entity relationships to track data lineage and research workflow connections for comprehensive data provenance.

Last updated 17 days ago

Was this helpful?

hashtagOverview

hashtagWhen to Use This App

hashtagRelationship to Other Ingestion Tools

hashtagHow to Use the App

hashtagUsing the UI

hashtagUsing the CLI

hashtagUsing the Data Catalog Loader App

hashtagMetadata Updates

hashtagInputs

hashtagOptional Configuration

hashtagProcess

hashtagOutputs

hashtagBest Practices

hashtagOrganizational Coordination

hashtagWorkflow Integration

hashtagPlatform Integration

Overview

When to Use This App

Relationship to Other Ingestion Tools

How to Use the App

Using the UI

Using the CLI

Using the Data Catalog Loader App

Metadata Updates

Inputs

Optional Configuration

Process

Outputs

Best Practices

Organizational Coordination

Workflow Integration

Platform Integration