Concepts and Architecture
A license is required to use the Omics Data Catalog on the DNAnexus Platform. Contact DNAnexus Sales for more information.
The Omics Data Catalog provides metadata management with organization-wide search and project-based access controls.
Core Architecture
The Omics Data Catalog operates with one catalog instance that can be shared across multiple organizations. Data catalogs are billed to a specific organization (the billTo organization), which maintains administrative control. Additional organizations can be invited to access the data catalog, enabling cross-organizational collaboration while maintaining project-based access controls.
The following diagram shows how a data catalog can be structured. Projects from multiple organizations can connect to a single catalog instance, with access controlled by both organization membership and project permissions:
Project Relationships and Metadata Management
The Omics Data Catalog maintains relationships between projects, data objects such as files, and metadata in the following ways:
A data object cloned across multiple projects (N) can have up to N metadata records linked to the data object entity, each associated with its respective project.
Mismatches between project associations can result in records being ingested without establishing relationships between them.
You cannot share metadata with users who don't have access to the project associated with that metadata. To view and search records, users must belong to an organization with access to the data catalog (either the billTo organization or an invited organization) and have appropriate project permissions.
How Metadata Relates to Projects and Files
Files copied between projects don't automatically carry their metadata with them. The metadata remains tied to the original project context. This means manual re-ingestion is required to associate metadata with files in new project locations. The project context determines metadata association, so the same file may have different metadata records in different projects.
The catalog automatically removes associated metadata when data objects or projects are deleted to maintain data integrity. Data object deletions trigger immediate metadata removal, while project deletions or organizational transfers remove all associated metadata after a small delay. This cleanup occurs regardless of whether project synchronization is enabled or disabled, ensuring data residency requirements are maintained.
Access Control and Permissions
Understand how Omics Data Catalog permissions work to know what data is available and how collaboration affects data visibility.
Permission-Based Data Visibility
You can only see metadata from projects where you have at least VIEW access. The catalog searches across the entire metadata collection in data catalogs available to your organization but automatically filters results based on your project permissions.
This permission-based behavior means your search results may be different from a colleague's, even with identical filters. For example, when you share search URLs, recipients see results filtered according to their own project permissions, enabling secure collaboration without exposing unauthorized data.
Permission Levels and Capabilities
The catalog inherits your existing project permissions from the DNAnexus Platform:
None
No access to metadata from the project, cannot search or view any records, no export capabilities, no relationship navigation
VIEW
Search and view metadata from the project, export search results to CSV, follow relationship links to linked entities, and share search URLs
UPLOAD
All VIEW permissions, no additional catalog-specific capabilities beyond standard project file upload permissions
CONTRIBUTE
All VIEW permissions plus create, edit, and delete permissions on files. Add new metadata records, modify existing metadata, and use the Data Catalog Loader app to ingest metadata.
ADMINISTER
All CONTRIBUTE permissions plus modify metadata in protected projects
Specific actions may be restricted based on project data access controls.
Automatic Permission Updates
When project permissions change, catalog access updates automatically. Adding access to new projects immediately includes the project's metadata in your searches, while removing access immediately excludes that metadata from your results.
Public Entities
By default, you can only view metadata from projects where you have at least VIEW access. However, you can override this behavior by marking some entities as public using the isPublicInDataCatalog flag. Public entities make their metadata visible to all users with access to the data catalog, regardless of project permissions. This enables cross-project data discovery and linking.
All catalog users can view and search the metadata of all records for public entities. For example, if "analysis" is designated as a public entity, any catalog user can view all analysis record metadata and link to them from any project. However, users without appropriate project permissions cannot access the underlying data objects referenced by these records. Creating, updating, and deleting public entity records also requires appropriate project-level permissions.
You can use public entities to reference data that multiple projects need to access. Public entities work well for many-to-one relationships where projects reference shared data, or for cross-study entities that link data across organizational boundaries.
Avoid using public entities for sensitive or regulated data, participant information, or project-specific records. Public entity metadata is visible to all catalog users, so use them only for non-sensitive, broadly shareable metadata
Because public entity records are visible catalog-wide, their IDs must be unique across the entire data catalog to prevent conflicts. Establish ID naming conventions before adding records. This prevents conflicts when multiple projects contribute records to the same public entity.
Schema and Data Types
The schema defines the structure and validation rules for metadata across your organization. It specifies entity types, fields, data types, and relationships between entities.
Schema development typically involves collaboration between researchers, data managers, and DNAnexus. Any subsequent schema changes require DNAnexus involvement and coordination. If you'd like to change your schema, contact DNAnexus Support.
Entities and Relationships
Entities represent the building blocks of your data model, such as participants, samples, assays, and data objects. You can define relationships between these entities to enable queries that span multiple entity types.
For example, you can find all RNA sequencing data from subjects with specific demographics by following subject ← sample ← data object relationships. The arrow direction indicates the linking field direction: a data object points to (belongs to) a sample, meaning each data object can belong to zero or one sample, but one sample may have many data objects.
The following diagram shows an example of entity relationships in Omics Data Catalog. This illustrates how entities can have complex relationships, not just hierarchical structures:
Supported Data Types
The Omics Data Catalog supports specific data types with defined validation rules and limitations:
String
Short text values for general metadata
Maximum 255 characters
"tissue", "mus musculus (mouse)", "RiboFree Total RNA Library Kit"
LongString
Extended text for detailed descriptions
Maximum 10,000 characters. Values longer than 255 characters are truncated in exports and API responses.
Protocol descriptions, detailed study summaries
ID
Identifier values for linking entities
Minimum 1, maximum 40 characters. Required for entity relationships.
"Iv3-78", "sample-id-234536", "A1B2C3D4E5F6G7H8I9J0K1L"
Integer
Whole number values
Range: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
"2", "4712384", "-840000090000"
Decimal
Floating-point numbers
Maximum precision of 20 significant digits
"-7198.8", "0.0000000012", "5.34E-2", "-0.1e4"
Date
Date values without timezone
Must be valid YYYY-MM-DD format, year 0001 or later
"2024-01-01", "1999-12-12"
DateTime
Date and time with timezone
RFC 3339 format, year 0001 or later
"2020-01-01T00:00:00+02:00", "2020-01-01T00:00:00.123Z"
Most data types can accept null values. System-generated metadata, primary ID fields, and required fields cannot be null.
Date and Time Formats
Date and time values must follow the RFC 3339 format (a subset of ISO 8601):
2016-12-31T23:50:00Z– UTC time (Z suffix)2016-12-31T23:50:00+00:00– UTC time with explicit zero offset2017-01-01T01:50:00+0100– Time with +1 hour offset from UTC
The earliest supported date is 0001-01-01.
System-Generated Metadata
For data object entities representing platform data objects (files), the following system-generated metadata is automatically included and cannot be manually edited.
archival_state
Archival state of the file
live, archival, archived, unarchiving
created_at
File upload timestamp
2024-01-15T14:30:00Z
created_by
User handle of the file uploader
user-jsmith
file_name
Name of the file
sample_001.fastq.gz
hidden
Whether the file is hidden
true, false
job_id
Associated analysis job identifier
job-GqjqBK00jy1PZ0qJ9k2K000Q
size
File size in bytes
1048576
Data Ingestion
Metadata must be explicitly ingested to become discoverable by using the Data Catalog Loader app or the Omics Data Catalog API directly. Also, you can enable project synchronization to automatically sync system-generated metadata for data objects in curated projects.
Platform Integration and Data Synchronization
The Omics Data Catalog automatically synchronizes with platform changes to keep metadata aligned with file changes, project updates, and permission modifications.
Metadata synchronization only applies to system-generated metadata for data objects. Custom metadata ingested through the Data Catalog Loader app is not automatically updated and requires manual re-ingestion when changes are needed.
Controlling Project Synchronization
Metadata synchronization is disabled by default for all projects. We recommend enabling synchronization only on curated projects intended for catalog visibility, not on sandbox or test projects.
Project admins can control synchronization through Project Settings > Metadata Synchronization to:
Trigger immediate synchronization of data object metadata on-demand.
Enable automatic synchronization to keep metadata continuously updated.
View the current synchronization state and time of last sync.
Automatic synchronization runs every 6 hours. Changes may take up to 24 hours to be reflected in the catalog, depending on when the next sync runs and how long processing takes.
Cleanup options
If a project was synced unintentionally and added data objects you do not want in the catalog, you can clean up in two ways:
Use the
/dataCatalog-xxxx/removeRecordsAPI method to remove data object records by project and IDs.Use the Data Catalog Loader with Delete specified records? and a CSV containing the data object IDs and the target project ID.
Data Consistency and Timing
The system operates on eventual consistency. Metadata aligns with platform data over time, but temporary discrepancies may occur during updates.
File deletions remove associated data object records from the catalog regardless of whether project synchronization is enabled or disabled. File deletions are processed within an hour, while complex operations involving multiple entity relationships may take up to 24 hours.
Last updated
Was this helpful?