Data Catalog Loader
A license is required to use the Omics Data Catalog on the DNAnexus Platform. Contact DNAnexus Sales for more information.
The Data Catalog Loader app ingests, updates, and manages structured metadata in the Omics Data Catalog. This app transforms CSV files and Illumina Sample Sheets into searchable metadata records that integrate with the DNAnexus Platform.
You can also use the Omics Data Catalog API directly to programmatically manage metadata records.
Overview
The Data Catalog Loader enables organizations to create structured, searchable metadata catalogs that span multiple research projects. Unlike traditional data ingestion tools that create isolated datasets, this app builds interconnected metadata networks that support organization-wide discovery while maintaining project-based access controls.

When to Use This App
Use the Data Catalog Loader when you need to:
Create consistent metadata standards across research projects to harmonize data from multiple studies.
Make data findable across your organization without compromising access controls for cross-project discovery.
Connect entities, such as samples, assays, or analysis results, in meaningful relationships through research workflows.
Relationship to Other Ingestion Tools
Data Catalog Loader
Metadata management and discovery
Cross-project metadata catalogs
Data Model Loader
Phenotypic data ingestion
Creating Apollo Datasets for analysis workflows
Molecular Expression Assay Loader
Omics data ingestion
Ingesting expression matrices for computational analysis
How to Use the App
Using the UI
To use the Data Catalog Loader within the DNAnexus Platform:
In the DNAnexus Platform, go to Tools > Tools Library.
For the Data Catalog Loader app, click Run Latest Version.
In Output to, select a project and output location for the app's outputs.
The project specified in Output to is where the job runs and is billed. You can specify a different project for storing the ingested metadata by using the Project ID optional input.
Click Next.
In the Inputs tab, specify the inputs.
Click Start Analysis.
Using the CLI
To use the Data Catalog Loader from the command-line interface, install the DNAnexus Platform SDK.
When using the DNAnexus Platform through Cloud Workspace or JupyterLab, the DNAnexus SDK is preinstalled. You can use the dx command right away.
Use the following command format, customizing the input parameters for your specific data:
Using the Data Catalog Loader App
To ingest metadata using the Data Catalog Loader app, you need CONTRIBUTE permissions or higher in the target project. For protected projects, ADMINISTER permissions are required to modify metadata.
For detailed technical information about file formats, advanced configuration options, and troubleshooting, refer to the Data Catalog Loader app documentation available in the DNAnexus Platform.
Metadata Updates
When updating existing metadata, the Data Catalog Loader follows these rules:
If multiple CSVs with the same entity ID contain records with the same ID, values from later files override values from earlier files, and only the final occurrence is stored in the catalog (unless the job fails during processing).
To update metadata already in the catalog, upload a CSV with the same entity ID and the same record IDs. New values automatically overwrite existing metadata without requiring deletion first.
Inputs
The app requires these key inputs to create or modify metadata in a data catalog:
Data CSVs - CSV files or Illumina Sample Sheets containing structured metadata. For details on formatting the input metadata files and handling values such as nulls, empty cells, and missing fields, refer to the Data Catalog Loader app documentation.
File template - Specifies the format of the files provided in Data CSVs. Can be either Metadata CSV (default) or Illumina Sample Sheet V2.
Optional Configuration
Choose from these processing modes based on your workflow needs:
Project ID - Specify a different project for metadata ingestion than where the job runs. For example, you can run jobs in one project for organizational billing while storing metadata in another project to control visibility through that project's access permissions.
Mapping JSON file - Translation between your data and organizational schema
Delete specified records? - Remove specified records from the catalog
Should this app run in strict mode? - Enforce complete schema compliance during validation
Execute the job as a Dry Run? - Validate metadata structure and content without making changes
Should the app provide verbose output? - Generate detailed processing logs for troubleshooting
Process
The app processes your metadata through these conceptual steps:
Schema Alignment: Data structure is validated against the defined schema.
Files are mapped to entities based on file names (if using Metadata CSV) or file structure (if using Illumina Sample Sheet V2).
File content is mapped to fields of the corresponding entities. The user-provided mapping specified in Mapping JSON file can help with this step.
Validation: Expected data types and relationships are determined and validated.
Data Catalog Modification: The changes are applied to the data catalog through the data catalog API.
Outputs
Processing Reports - Detailed status information and error diagnostics
Searchable Metadata - Records become discoverable through the Omics Data Catalog
Best Practices
Organizational Coordination
Collaborate with data administrators to align metadata ingestion with organizational schema and governance policies.
Work with DNAnexus Support for schema changes and data migration strategies when planning for schema evolution.
Develop consistent terminology and formatting conventions across research teams to establish metadata standards.
Workflow Integration
Use the dry run mode to validate new metadata sources and organizational standards before committing changes.
Ensure appropriate permissions for metadata visibility and collaboration requirements by coordinating project access.
Review processing reports and establish ongoing quality assurance processes to monitor data quality.
Platform Integration
Design metadata structure to support downstream analysis and reporting needs for seamless workflow integration.
Structure metadata to enable effective cross-project search and filtering to optimize discovery patterns.
Use entity relationships to track data lineage and research workflow connections for comprehensive data provenance.
Last updated
Was this helpful?