# Data Catalog Loader

{% hint style="info" %}
A license is required to use the Omics Data Catalog on the DNAnexus Platform. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

The [Data Catalog Loader](https://platform.dnanexus.com/app/data_catalog_loader) app ingests, updates, and manages structured metadata in the [Omics Data Catalog](https://documentation.dnanexus.com/user/omics-data-catalog). This app transforms CSV files and Illumina Sample Sheets into searchable metadata records that integrate with the DNAnexus Platform.

You can also use the [Omics Data Catalog API](https://documentation.dnanexus.com/developer/api/omics-data-catalog) directly to programmatically manage metadata records.

## Overview

The Data Catalog Loader enables organizations to create structured, searchable metadata catalogs that span multiple research projects. Unlike traditional data ingestion tools that create isolated datasets, this app builds interconnected metadata networks that support organization-wide discovery while maintaining project-based access controls.

![Overview of all required and optional file inputs for the Data Catalog Loader app](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-0bcf1be6781d484b9bee56a8a9ed203745a3722c%2Fomics-data-catalog-loader-io.png?alt=media)

### When to Use This App

Use the Data Catalog Loader when you need to:

* Create consistent metadata standards across research projects to harmonize data from multiple studies.
* Make data findable across your organization without compromising access controls for cross-project discovery.
* Connect entities, such as samples, assays, or analysis results, in meaningful relationships through research workflows.

### Relationship to Other Ingestion Tools

| Tool                                  | Purpose                           | Use Case                                                 |
| ------------------------------------- | --------------------------------- | -------------------------------------------------------- |
| **Data Catalog Loader**               | Metadata management and discovery | Cross-project metadata catalogs                          |
| **Data Model Loader**                 | Phenotypic data ingestion         | Creating Apollo Datasets for analysis workflows          |
| **Molecular Expression Assay Loader** | Omics data ingestion              | Ingesting expression matrices for computational analysis |

## How to Use the App

### Using the UI

To use the Data Catalog Loader within the DNAnexus Platform:

1. In the DNAnexus Platform, go to **Tools** > [**Tools Library**](https://platform.dnanexus.com/panx/tools).
2. For the [Data Catalog Loader](https://platform.dnanexus.com/app/data_catalog_loader) app, click **Run Latest Version**.
3. In **Output to**, select a project and output location for the app's outputs.

   <div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p>The project specified in <strong>Output to</strong> is where the job runs and is billed. You can specify a different project for storing the ingested metadata by using the <strong>Project ID</strong> <a href="#optional-configuration">optional input</a>.</p></div>
4. Click **Next**.
5. In the **Inputs** tab, specify the [inputs](#inputs).
6. Click **Start Analysis**.

### Using the CLI

To use the Data Catalog Loader from the command-line interface, install the [DNAnexus Platform SDK](https://documentation.dnanexus.com/downloads#dnanexus-platform-sdk).

{% hint style="success" %}
When using the DNAnexus Platform through Cloud Workspace or JupyterLab, the DNAnexus SDK is preinstalled. You can use the `dx` command right away.
{% endhint %}

Use the following command format, customizing the input parameters for your specific data:

```shell
dx run data_catalog_loader \
  -i data_csvs='["new_samples_sample.csv", "new_files_data_object.csv"]' \
  -i file_template="metadata_csv" \
  -i project_id="project-xxxx"
```

## Using the Data Catalog Loader App

To ingest metadata using the Data Catalog Loader app, you need CONTRIBUTE permissions or higher in the target project. For protected projects, ADMINISTER permissions are required to modify metadata.

For detailed technical information about file formats, advanced configuration options, and troubleshooting, refer to the [Data Catalog Loader app documentation](https://platform.dnanexus.com/app/data_catalog_loader) available in the DNAnexus Platform.

### Metadata Updates

When updating existing metadata, the Data Catalog Loader follows these rules:

* If multiple CSVs with the same entity ID contain records with the same ID, values from later files override values from earlier files, and only the final occurrence is stored in the catalog (unless the job fails during processing).
* To update metadata already in the catalog, upload a CSV with the same entity ID and the same record IDs. New values automatically overwrite existing metadata without requiring deletion first.

### Inputs

The app requires these key inputs to create or modify metadata in a data catalog:

* **Data CSVs** - CSV files or Illumina Sample Sheets containing structured metadata. For details on formatting the input metadata files and handling values such as nulls, empty cells, and missing fields, refer to the [Data Catalog Loader app documentation](https://platform.dnanexus.com/app/data_catalog_loader).
* **File template** - Specifies the format of the files provided in Data CSVs. Can be either Metadata CSV (default) or Illumina Sample Sheet V2.

#### Optional Configuration

Choose from these processing modes based on your workflow needs:

* **Project ID** - Specify a different project for metadata ingestion than where the job runs. For example, you can run jobs in one project for organizational billing while storing metadata in another project to control visibility through that project's access permissions.
* **Mapping JSON file** - Translation between your data and organizational schema
* **Delete specified records?** - Remove specified records from the catalog
* **Should this app run in strict mode?** - Enforce complete schema compliance during validation
* **Execute the job as a Dry Run?** - Validate metadata structure and content without making changes
* **Should the app provide verbose output?** - Generate detailed processing logs for troubleshooting

### Process

The app processes your metadata through these conceptual steps:

1. **Schema Alignment**: Data structure is validated against the defined schema.
   * Files are mapped to entities based on file names (if using *Metadata CSV*) or file structure (if using *Illumina Sample Sheet V2*).
   * File content is mapped to fields of the corresponding entities. The user-provided mapping specified in *Mapping JSON file* can help with this step.
2. **Validation**: Expected data types and relationships are determined and validated.
3. **Data Catalog Modification**: The changes are applied to the data catalog through the [data catalog API](https://documentation.dnanexus.com/developer/api/omics-data-catalog).

### Outputs

* **Processing Reports** - Detailed status information and error diagnostics
* **Searchable Metadata** - Records become discoverable through the [Omics Data Catalog](https://documentation.dnanexus.com/user/omics-data-catalog)

## Best Practices

### Organizational Coordination

* Collaborate with data administrators to align metadata ingestion with organizational schema and governance policies.
* Work with DNAnexus Support for schema changes and data migration strategies when planning for schema evolution.
* Develop consistent terminology and formatting conventions across research teams to establish metadata standards.

### Workflow Integration

* Use the dry run mode to validate new metadata sources and organizational standards before committing changes.
* Ensure appropriate permissions for metadata visibility and collaboration requirements by coordinating project access.
* Review processing reports and establish ongoing quality assurance processes to monitor data quality.

### Platform Integration

* Design metadata structure to support downstream analysis and reporting needs for seamless workflow integration.
* Structure metadata to enable effective cross-project search and filtering to optimize discovery patterns.
* Use entity relationships to track data lineage and research workflow connections for comprehensive data provenance.
