# Concepts and Architecture

{% hint style="info" %}
A license is required to use the Omics Data Catalog on the DNAnexus Platform. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

The Omics Data Catalog provides metadata management with organization-wide search and project-based access controls.

## Core Architecture

The Omics Data Catalog operates with one catalog instance that can be shared across multiple organizations. Data catalogs are billed to a specific organization (the `billTo` organization), which maintains administrative control. Additional organizations can be invited to access the data catalog, enabling cross-organizational collaboration while maintaining project-based access controls.

The following diagram shows how a data catalog can be structured. Projects from multiple organizations can connect to a single catalog instance, with access controlled by both organization membership and project permissions:

{% @mermaid/diagram content="flowchart TD
subgraph "Data Catalog"
OC(Data Catalog)
SC(Schema Configuration)
end

```
subgraph "Organization A (billTo)"
    P1[Project 1]
    P2[Project 2]
end

subgraph "Organization B (invited)"
    P3[Project 3]
    P4[Project 4]
end

subgraph "Organization C (invited)"
    P5[Project 5]
end

%% Define connections
SC --> OC
OC --- P1
OC --- P2
OC --- P3
OC --- P4
OC --- P5

%% Styling
classDef catalogNode fill:#1F519D,stroke:#1F519D,stroke-width:2px,color:#fff
classDef schemaNode fill:#66C9F9,stroke:#1F519D,stroke-width:2px,color:#000
classDef projectNode fill:#60F299,stroke:#1F519D,stroke-width:2px,color:#000,stroke-dasharray: 5 5

class OC catalogNode
class SC schemaNode
class P1,P2,P3,P4,P5 projectNode" %}
```

## Project Relationships and Metadata Management

The Omics Data Catalog maintains relationships between projects, data objects such as files, and metadata in the following ways:

* A data object cloned across *N* projects can have up to *N* metadata records, one per project, each linked to the data object entity.
* Mismatches between project associations can result in records being ingested without establishing relationships between them.

You cannot share metadata with users who don't have access to the project associated with that metadata. To view and search records, users must belong to an organization with access to the data catalog (either the `billTo` organization or an invited organization) and have appropriate project permissions.

### How Metadata Relates to Projects and Files

Files copied between projects don't automatically carry their metadata with them. The metadata remains tied to the original project context. This means manual re-ingestion is required to associate metadata with files in new project locations. The project context determines metadata association, so the same file may have different metadata records in different projects.

The catalog automatically removes associated metadata when data objects or projects are deleted to maintain data integrity. Data object deletions trigger immediate metadata removal, while project deletions or organizational transfers remove all associated metadata after a small delay. This cleanup occurs regardless of whether [project synchronization](#controlling-project-synchronization) is enabled or disabled, ensuring data residency requirements are maintained.

## Access Control and Permissions

Understand how Omics Data Catalog permissions work to know what data is available and how collaboration affects data visibility.

### Permission-Based Data Visibility

You can only see metadata from projects where you have at least VIEW access. The catalog searches across the entire metadata collection in data catalogs available to your organization but automatically filters results based on your project permissions.

This permission-based behavior means your search results may be different from a colleague's, even with identical filters. For example, when you [share search URLs](https://documentation.dnanexus.com/user/omics-data-catalog/..#share-and-export-results), recipients see results filtered according to their own project permissions, enabling secure collaboration without exposing unauthorized data.

### Permission Levels and Capabilities

The catalog inherits your existing project permissions from the DNAnexus Platform:

| Project access level | Capabilities                                                                                                                                                                                                                                                            |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **None**             | No access to metadata from the project, cannot search or view any records, no export capabilities, no relationship navigation                                                                                                                                           |
| **VIEW**             | Search and view metadata from the project, export search results to CSV, follow relationship links to linked entities, and share search URLs                                                                                                                            |
| **UPLOAD**           | All VIEW permissions, no additional catalog-specific capabilities beyond standard project file upload permissions                                                                                                                                                       |
| **CONTRIBUTE**       | All VIEW permissions plus the ability to create, edit, and delete files, add new metadata records, modify existing metadata, and use the [Data Catalog Loader app](https://documentation.dnanexus.com/developer/ingesting-data/data-catalog-loader) to ingest metadata. |
| **ADMINISTER**       | All CONTRIBUTE permissions plus modify metadata in [protected projects](https://documentation.dnanexus.com/getting-started/key-concepts/projects#project-data-access-controls)                                                                                          |

Specific actions may be restricted based on [project data access controls](https://documentation.dnanexus.com/getting-started/key-concepts/projects#project-data-access-controls).

{% hint style="info" %}
**Automatic Permission Updates**

When project permissions change, catalog access updates automatically. Adding access to new projects immediately includes the project's metadata in your searches, while removing access immediately excludes that metadata from your results.
{% endhint %}

## Public Entities

By default, you can only view metadata from projects where you have at least VIEW access. However, you can override this behavior by marking some entities as public using the `isPublicInDataCatalog` flag. Public entities make their metadata visible to all users with access to the data catalog, regardless of project permissions. This enables cross-project data discovery and linking.

All catalog users can view and search the metadata of all records for public entities. For example, if "analysis" is designated as a public entity, any catalog user can view all analysis record metadata and link to them from any project. However, users without appropriate project permissions cannot access the underlying data objects referenced by these records. Creating, updating, and deleting public entity records also requires appropriate [project-level permissions](#permission-based-data-visibility).

You can use public entities to reference data that multiple projects need to access. Public entities work well for many-to-one relationships where projects reference shared data, or for cross-study entities that link data across organizational boundaries.

{% hint style="warning" %}
Avoid using public entities for sensitive or regulated data, participant information, or project-specific records. Public entity metadata is visible to all catalog users, so use them only for non-sensitive, broadly shareable metadata.
{% endhint %}

Because public entity records are visible catalog-wide, their IDs must be unique across the entire data catalog to prevent conflicts. Establish ID naming conventions before adding records. This prevents conflicts when multiple projects contribute records to the same public entity.

## Schema and Data Types

The schema defines the structure and validation rules for metadata across your organization. It specifies entity types, fields, data types, and relationships between entities.

Schema development typically involves collaboration between researchers, data managers, and DNAnexus. Any subsequent schema changes require DNAnexus involvement and coordination. If you'd like to change your schema, contact [DNAnexus Support](mailto:support@dnanexus.com).

### Entities and Relationships

Entities represent the building blocks of your data model, such as participants, samples, assays, and data objects. You can define relationships between these entities to enable queries that span multiple entity types.

For example, you can find all RNA sequencing data from subjects with specific demographics by following subject ← sample ← data object relationships. The arrow direction indicates the linking field direction: a data object points to (belongs to) a sample, meaning each data object can belong to zero or one sample, but one sample may have many data objects.

The following diagram shows an example of entity relationships in Omics Data Catalog. This illustrates how entities can have complex relationships, not just hierarchical structures:

{% @mermaid/diagram content="flowchart TD
DO\[**Data Object**<br/>files, analysis results<br/>genomic data]
SA\[**Sample**<br/>tissue type<br/>collection details]
SU\[**Subject**<br/>demographics<br/>clinical data]
ST\[**Study**<br/>research protocol<br/>study metadata]
PA\[**Participation**<br/>enrollment details<br/>consent status]
ME\[**Medication**<br/>drug information<br/>dosage records]

```
%% Core data hierarchy: subject <- sample <- data object
DO --> SA
SA --> SU

%% Supporting data
ME --> SU

%% Study relationships
PA --> ST
PA --> SU

%% Styling with logical grouping
classDef coreDataFlow fill:#1F519D,stroke:#1F519D,stroke-width:2px,color:#fff
classDef studyContext fill:#66C9F9,stroke:#1F519D,stroke-width:2px,color:#000
classDef supportingData fill:#60F299,stroke:#1F519D,stroke-width:2px,color:#000

class DO,SA,SU coreDataFlow
class ST,PA studyContext
class ME supportingData" %}
```

### Supported Data Types

The Omics Data Catalog supports specific data types with defined validation rules and limitations:

| Data Type      | Description                             | Limitations                                                                                              | Examples                                                                 |
| -------------- | --------------------------------------- | -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| **String**     | Short text values for general metadata  | Maximum 255 characters                                                                                   | `"tissue"`, `"mus musculus (mouse)"`, `"RiboFree Total RNA Library Kit"` |
| **LongString** | Extended text for detailed descriptions | Maximum 10,000 characters. Values longer than 255 characters are truncated in exports and API responses. | Protocol descriptions, detailed study summaries                          |
| **ID**         | Identifier values for linking entities  | Minimum 1, maximum 40 characters. Required for entity relationships.                                     | `"Iv3-78"`, `"sample-id-234536"`, `"A1B2C3D4E5F6G7H8I9J0K1L"`            |
| **Integer**    | Whole number values                     | Range: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807                                           | `"2"`, `"4712384"`, `"-840000090000"`                                    |
| **Decimal**    | Floating-point numbers                  | Maximum precision of 20 significant digits                                                               | `"-7198.8"`, `"0.0000000012"`, `"5.34E-2"`, `"-0.1e4"`                   |
| **Date**       | Date values without timezone            | Must be valid YYYY-MM-DD format, year 0001 or later                                                      | `"2024-01-01"`, `"1999-12-12"`                                           |
| **DateTime**   | Date and time with timezone             | [RFC 3339](https://datatracker.ietf.org/doc/html/rfc3339) format, year 0001 or later                     | `"2020-01-01T00:00:00+02:00"`, `"2020-01-01T00:00:00.123Z"`              |

{% hint style="info" %}
Most data types can accept null values. System-generated metadata, primary ID fields, and required fields cannot be null.
{% endhint %}

### Date and Time Formats

Date and time values must follow the [RFC 3339](https://www.rfc-editor.org/rfc/rfc3339#section-5.6) format (a subset of ISO 8601):

* `2016-12-31T23:50:00Z` – UTC time (Z suffix)
* `2016-12-31T23:50:00+00:00` – UTC time with explicit zero offset
* `2017-01-01T01:50:00+0100` – Time with +1 hour offset from UTC

The earliest supported date is `0001-01-01`.

### System-Generated Metadata

For data object entities representing platform data objects (files), the following system-generated metadata is automatically included and cannot be manually edited.

| Field            | Description                                                                                                                        | Example value                                 |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------- |
| `archival_state` | Archival state of the file                                                                                                         | `live`, `archival`, `archived`, `unarchiving` |
| `asset_class`    | [Data object class](https://documentation.dnanexus.com/developer/api/data-object-lifecycle) determined by file type classification | `file`, `record`                              |
| `created_at`     | File upload timestamp                                                                                                              | `2024-01-15T14:30:00Z`                        |
| `created_by`     | User handle of the file uploader                                                                                                   | `user-jsmith`                                 |
| `file_format`    | Detected file [MIME type](https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/MIME_types)                                     | `text/plain`                                  |
| `file_name`      | Name of the file                                                                                                                   | `sample_001.fastq.gz`                         |
| `hidden`         | Whether the file is hidden                                                                                                         | `true`, `false`                               |
| `job_id`         | Associated analysis job identifier                                                                                                 | `job-GqjqBK00jy1PZ0qJ9k2K000Q`                |
| `size`           | File size in bytes                                                                                                                 | `1048576`                                     |

## Data Ingestion

Metadata must be explicitly ingested to become discoverable by using the [Data Catalog Loader app](https://documentation.dnanexus.com/developer/ingesting-data/data-catalog-loader) or the [Omics Data Catalog API](https://documentation.dnanexus.com/developer/api/omics-data-catalog) directly. Also, you can [enable project synchronization](#controlling-project-synchronization) to automatically sync system-generated metadata for data objects in curated projects.

## Platform Integration and Data Synchronization

The Omics Data Catalog automatically synchronizes with platform changes to keep metadata aligned with file changes, project updates, and permission modifications.

Metadata synchronization only applies to [system-generated metadata](#system-generated-metadata) for data objects. Custom metadata ingested through the [Data Catalog Loader app](https://documentation.dnanexus.com/developer/ingesting-data/data-catalog-loader) is not automatically updated and requires manual re-ingestion when changes are needed.

### Controlling Project Synchronization

Metadata synchronization is disabled by default for all projects. We recommend enabling synchronization only on curated projects intended for catalog visibility, not on sandbox or test projects.

Project admins can control synchronization through **Project Settings > Metadata Synchronization** to:

* Trigger immediate synchronization of data object metadata on-demand.
* Enable automatic synchronization to keep metadata continuously updated.
* View the current synchronization state and time of last sync.

Automatic synchronization runs every 6 hours. Changes may take up to 24 hours to be reflected in the catalog, depending on when the next sync runs and how long processing takes.

{% hint style="info" %}
**Cleanup options**

If a project was synced unintentionally and added data objects you do not want in the catalog, you can clean up in two ways:

* Use the [`/dataCatalog-xxxx/removeRecords`](https://documentation.dnanexus.com/developer/api/omics-data-catalog#api-method-datacatalog-xxxx-removerecords) API method to remove data object records by project and IDs.
* Use the [Data Catalog Loader](https://documentation.dnanexus.com/developer/ingesting-data/data-catalog-loader#optional-configuration) with **Delete specified records?** and a CSV containing the data object IDs and the target project ID.
  {% endhint %}

### Data Consistency and Timing

The system operates on eventual consistency. Metadata aligns with platform data over time, but temporary discrepancies may occur during updates.

File deletions remove associated data object records from the catalog regardless of whether project synchronization is enabled or disabled. File deletions are processed within an hour, while complex operations involving multiple entity relationships may take up to 24 hours.
