Concepts and Architecture

circle-info

A license is required to use the Omics Data Catalog on the DNAnexus Platform. Contact DNAnexus Salesenvelope for more information.

The Omics Data Catalog provides metadata management with organization-wide search and project-based access controls.

Core Architecture

The Omics Data Catalog operates with one catalog instance that can be shared across multiple organizations. Data catalogs are billed to a specific organization (the billTo organization), which maintains administrative control. Additional organizations can be invited to access the data catalog, enabling cross-organizational collaboration while maintaining project-based access controls.

The following diagram shows how a data catalog can be structured. Projects from multiple organizations can connect to a single catalog instance, with access controlled by both organization membership and project permissions:

spinner

Project Relationships and Metadata Management

The Omics Data Catalog maintains relationships between projects, data objects such as files, and metadata in the following ways:

  • A data object cloned across multiple projects (N) can have up to N metadata records linked to the data object entity, each associated with its respective project.

  • Mismatches between project associations can result in records being ingested without establishing relationships between them.

You cannot share metadata with users who don't have access to the project associated with that metadata. To view and search records, users must belong to an organization with access to the data catalog (either the billTo organization or an invited organization) and have appropriate project permissions.

How Metadata Relates to Projects and Files

Files copied between projects don't automatically carry their metadata with them. The metadata remains tied to the original project context. This means manual re-ingestion is required to associate metadata with files in new project locations. The project context determines metadata association, so the same file may have different metadata records in different projects.

The catalog automatically removes associated metadata when data objects or projects are deleted to maintain data integrity. Data object deletions trigger immediate metadata removal, while project deletions or organizational transfers remove all associated metadata after a small delay. This cleanup occurs regardless of whether project synchronization is enabled or disabled, ensuring data residency requirements are maintained.

Access Control and Permissions

Understand how Omics Data Catalog permissions work to know what data is available and how collaboration affects data visibility.

Permission-Based Data Visibility

You can only see metadata from projects where you have at least VIEW access. The catalog searches across the entire metadata collection in data catalogs available to your organization but automatically filters results based on your project permissions.

This permission-based behavior means your search results may be different from a colleague's, even with identical filters. For example, when you share search URLs, recipients see results filtered according to their own project permissions, enabling secure collaboration without exposing unauthorized data.

Permission Levels and Capabilities

The catalog inherits your existing project permissions from the DNAnexus Platform:

Project access level
Capabilities

None

No access to metadata from the project, cannot search or view any records, no export capabilities, no relationship navigation

VIEW

Search and view metadata from the project, export search results to CSV, follow relationship links to linked entities, and share search URLs

UPLOAD

All VIEW permissions, no additional catalog-specific capabilities beyond standard project file upload permissions

CONTRIBUTE

All VIEW permissions plus create, edit, and delete permissions on files. Add new metadata records, modify existing metadata, and use the Data Catalog Loader app to ingest metadata.

ADMINISTER

All CONTRIBUTE permissions plus modify metadata in protected projects

Specific actions may be restricted based on project data access controls.

circle-info

Automatic Permission Updates

When project permissions change, catalog access updates automatically. Adding access to new projects immediately includes the project's metadata in your searches, while removing access immediately excludes that metadata from your results.

Public Entities

By default, you can only view metadata from projects where you have at least VIEW access. However, you can override this behavior by marking some entities as public using the isPublicInDataCatalog flag. Public entities make their metadata visible to all users with access to the data catalog, regardless of project permissions. This enables cross-project data discovery and linking.

All catalog users can view and search the metadata of all records for public entities. For example, if "analysis" is designated as a public entity, any catalog user can view all analysis record metadata and link to them from any project. However, users without appropriate project permissions cannot access the underlying data objects referenced by these records. Creating, updating, and deleting public entity records also requires appropriate project-level permissions.

You can use public entities to reference data that multiple projects need to access. Public entities work well for many-to-one relationships where projects reference shared data, or for cross-study entities that link data across organizational boundaries.

circle-exclamation

Because public entity records are visible catalog-wide, their IDs must be unique across the entire data catalog to prevent conflicts. Establish ID naming conventions before adding records. This prevents conflicts when multiple projects contribute records to the same public entity.

Schema and Data Types

The schema defines the structure and validation rules for metadata across your organization. It specifies entity types, fields, data types, and relationships between entities.

Schema development typically involves collaboration between researchers, data managers, and DNAnexus. Any subsequent schema changes require DNAnexus involvement and coordination. If you'd like to change your schema, contact DNAnexus Supportenvelope.

Entities and Relationships

Entities represent the building blocks of your data model, such as participants, samples, assays, and data objects. You can define relationships between these entities to enable queries that span multiple entity types.

For example, you can find all RNA sequencing data from subjects with specific demographics by following subject ← sample ← data object relationships. The arrow direction indicates the linking field direction: a data object points to (belongs to) a sample, meaning each data object can belong to zero or one sample, but one sample may have many data objects.

The following diagram shows an example of entity relationships in Omics Data Catalog. This illustrates how entities can have complex relationships, not just hierarchical structures:

spinner

Supported Data Types

The Omics Data Catalog supports specific data types with defined validation rules and limitations:

Data Type
Description
Limitations
Examples

String

Short text values for general metadata

Maximum 255 characters

"tissue", "mus musculus (mouse)", "RiboFree Total RNA Library Kit"

LongString

Extended text for detailed descriptions

Maximum 10,000 characters. Values longer than 255 characters are truncated in exports and API responses.

Protocol descriptions, detailed study summaries

ID

Identifier values for linking entities

Minimum 1, maximum 40 characters. Required for entity relationships.

"Iv3-78", "sample-id-234536", "A1B2C3D4E5F6G7H8I9J0K1L"

Integer

Whole number values

Range: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

"2", "4712384", "-840000090000"

Decimal

Floating-point numbers

Maximum precision of 20 significant digits

"-7198.8", "0.0000000012", "5.34E-2", "-0.1e4"

Date

Date values without timezone

Must be valid YYYY-MM-DD format, year 0001 or later

"2024-01-01", "1999-12-12"

DateTime

Date and time with timezone

RFC 3339arrow-up-right format, year 0001 or later

"2020-01-01T00:00:00+02:00", "2020-01-01T00:00:00.123Z"

circle-info

Most data types can accept null values. System-generated metadata, primary ID fields, and required fields cannot be null.

Date and Time Formats

Date and time values must follow the RFC 3339arrow-up-right format (a subset of ISO 8601):

  • 2016-12-31T23:50:00Z – UTC time (Z suffix)

  • 2016-12-31T23:50:00+00:00 – UTC time with explicit zero offset

  • 2017-01-01T01:50:00+0100 – Time with +1 hour offset from UTC

The earliest supported date is 0001-01-01.

System-Generated Metadata

For data object entities representing platform data objects (files), the following system-generated metadata is automatically included and cannot be manually edited.

Field
Description
Example value

archival_state

Archival state of the file

live, archival, archived, unarchiving

asset_class

Data object class determined by file type classification

file, record

created_at

File upload timestamp

2024-01-15T14:30:00Z

created_by

User handle of the file uploader

user-jsmith

file_format

text/plain

file_name

Name of the file

sample_001.fastq.gz

hidden

Whether the file is hidden

true, false

job_id

Associated analysis job identifier

job-GqjqBK00jy1PZ0qJ9k2K000Q

size

File size in bytes

1048576

Data Ingestion

Metadata must be explicitly ingested to become discoverable by using the Data Catalog Loader app or the Omics Data Catalog API directly. Also, you can enable project synchronization to automatically sync system-generated metadata for data objects in curated projects.

Platform Integration and Data Synchronization

The Omics Data Catalog automatically synchronizes with platform changes to keep metadata aligned with file changes, project updates, and permission modifications.

Metadata synchronization only applies to system-generated metadata for data objects. Custom metadata ingested through the Data Catalog Loader app is not automatically updated and requires manual re-ingestion when changes are needed.

Controlling Project Synchronization

Metadata synchronization is disabled by default for all projects. We recommend enabling synchronization only on curated projects intended for catalog visibility, not on sandbox or test projects.

Project admins can control synchronization through Project Settings > Metadata Synchronization to:

  • Trigger immediate synchronization of data object metadata on-demand.

  • Enable automatic synchronization to keep metadata continuously updated.

  • View the current synchronization state and time of last sync.

Automatic synchronization runs every 6 hours. Changes may take up to 24 hours to be reflected in the catalog, depending on when the next sync runs and how long processing takes.

circle-info

Cleanup options

If a project was synced unintentionally and added data objects you do not want in the catalog, you can clean up in two ways:

Data Consistency and Timing

The system operates on eventual consistency. Metadata aligns with platform data over time, but temporary discrepancies may occur during updates.

File deletions remove associated data object records from the catalog regardless of whether project synchronization is enabled or disabled. File deletions are processed within an hour, while complex operations involving multiple entity relationships may take up to 24 hours.

Last updated

Was this helpful?