Cohort Browser
An overview of the Cohort Browser's key features and how to use them.
Last updated
An overview of the Cohort Browser's key features and how to use them.
Last updated
Copyright 2024 DNAnexus
The Cohort Browser is accessible to all users of the UK Biobank Research Analysis Platform and the Our Future Health Trusted Research Environment.
For DNAnexus Platform users, an Apollo license is required to access the Cohort Browser. Contact DNAnexus Sales for more information.
DNAnexus Apollo builds on the technological foundation of the core DNAnexus Platform to offer scientists and bioinformaticians an environment to store and query large sets of genomic, phenotypic, multi-omic, and other structured data. Researchers can bring their data to the Platform and leverage DNAnexus apps to ingest the data into queryable databases.
These databases can then be explored using the Cohort Browser. Scientists can filter the dataset by any data field and save these filtered samples as cohorts. These cohorts can be shared with other scientists and also can be used as inputs to analysis apps to perform such tasks as calculating allele frequencies or performing a GWAS analysis.
The Cohort Browser dashboard can show up to three tabs based on the configuration of the dataset: Overview, Data Preview, and either Genomics (if the dataset contains germline genomic data) or Somatic Variants (if the dataset contains somatic variant data). Tabs are loaded as the user clicks on them, so if there is no change in filtering, the tabs will stay cached and will not need to reload.
Bioinformaticians who wish to perform ad hoc statistical analysis are able to create JupyterLab environments backed by Spark clusters to directly query their data and create dataframes within a Python or R environment for further analysis.
Datasets need to be prepared and ingested in order to be accessible via the Cohort Browser. See the Ingesting Data page for information on the ingestion process.
From the project where a dataset is located, go to Manage tab and select your dataset of interest. Click on the Explore Data action to open this dataset in Cohort Browser.
You can also access datasets via the Datasets page, which is located under the Projects menu. The Datasets page displays all datasets you have access to, and enables you to browse and find a specific dataset without navigating through projects.
You can use the optional information panel to view further information about a selected dataset, including creator, sponsorship, etc.
In the Cohort Browser's Overview tab, you'll see visualizations that provide an introduction to the dataset, and insights on the data it contains.
To create and view a chart visualizing data in a field, click the Add Tile button. The Add Tile dialog will open, showing a hierarchical view of all the data fields available in the dataset.
Browse the list or search an item by its title to narrow down the list.
Select a data field from the list. In the Data Field Details panel, you can see metadata on the selected data field, visualization preview, as well as options to customize chart types.
Confirm selection via the Add as Tile action. The new tile will appear on your dashboard.
For each data field, the range of available chart types depends on the type of data stored in the field. See Chart Types pages for more information on how each chart type can be built. Note that no more than 15 tiles can be added to the dashboard.
Once you've selected a primary data field in the Add Tile dialogue, you can add a secondary data field by clicking on the + icon next to an eligible secondary data field.
The +” icon only appears when at least one chart type is supported for the specified combination. See Chart Types - Multi-Variable Charts for more details.
For certain chart types - such as Stacked Row Chart and Scatter Plot - you can re-order the primary and secondary data fields by dragging on the data field in the Data Field Details section.
Note that Cohort Browser performance can be affected when more than 10 tiles are displayed.
This video provides a detailed overview of exploring new datasets using the Cohort Browser:
When you start exploration on a dataset, an empty cohort is created automatically in the Cohort Browser. You can further narrow down your cohort by adding cohort filters. Cohorts can be saved and exported for later use.
From the cohort which you wish to edit, click on Add Filter button.
Select a data field you want to filter by, confirm by clicking on Add as Filter.
Select operators and enter values to filter by. Click on Apply Filter to confirm.
Filters added are displayed in corresponding cohort panels. You can edit a specific filter any time by clicking on it, which would bring up the Edit Filter dialogue.
The default logical operator is 'AND'. To switch the operator to 'OR', click on the operator. For a filter group (a set of filters tied to 1 specific entity), all operators will be the same: all 'OR' or all 'AND'.
Once filters are added or edited, an updated cohort size will appear under name of the affected cohort. The dashboard will also auto-refresh to fetch updated results basing on latest cohort selection.
If your dataset includes germline genomic data, then you will have the option to add a genomic filter to your cohort.
From the cohort you wish to edit, click on Add Filter button.
Toggle to Geno tab.
Edit filter in Edit Genomic Filter dialogue by one of the following criteria:
Filter by genes and variant effects: Filter your dataset by variants of certain types and consequences within specified genes and/or genomic ranges. A maximum of 5 genes/ranges can be entered.
Filter by a list of variant IDs. A maximum of 100 variants can be entered.
If more than 1 range, gene, or variant is added, the values should be comma separated or each value must be on a new line.
Confirm edit by clicking on Apply Geno Filter button.
Similarly to the other cohort filters, a genomic filter is applied to the main entity of your dataset (in most cases, patients or participants).
For datasets with canonical transcript information available, an additional toggle will appear in the Genomic Filter dialogue titled "Match effects for canonical transcript only" which may be set to YES initially in order to restrict the results only to variants that have canonical information available.
Canonical transcripts, as defined by Ensembl, will be indicated with a blue marker next to their "Ensembl Transcript ID" in the Transcript column in the Genes/Transcripts table.
If your dataset includes data on somatic variants, the workflow for creating a genomic filter is very similar to that for datasets containing germline data.
In the Add Filter / Edit Filter pane, you will see options enabling you to:
Filter by genes and variant effects.
Filter by a particular HGVS DNA or HGVS protein notation, preceded by a gene symbol.
Filter by a list of variant IDs. A maximum of 10 variants can be entered.
Note that for each somatic variant filter, you can specify if matching variants are to be used as inclusion criteria or exclusion criteria for your cohort. By default, you will be selecting patients who have at least one detected variant that matches the specified criteria. To select patients or participants who do not have any matching variant, click the “WITH” dropdown button and change its value to “WITHOUT”.
You can create up to 10 somatic variant filters for each cohort.
When working with datasets that have multiple data entities, you can create a join filter by selecting data fields from a secondary entity and adding them as filters. An entity is a grouping of data around a unique item, event, or a concept: e.g. patient, visit, medication, laboratory tests.
Note that “Entity” in the Cohort Browser can refer either to a data model object (examples given above), or to the specific input parameter in the Table Exporter app. See the Table Exporter documentation here.
Join Filters are displayed as subrows deriving from the main entity. Depending on the entity to which your selected data field belongs, a join filter that reflects the relationship between those entities will be automatically created. To create a new cohort criteria using the join filters, click + Add filter or the Filter > Add filter on a tile. To add additional criteria to an existing criteria in a join click the Add additional criteria inline on the row of the chosen filter.
You can choose between the 'AND' or 'OR' logical operators when creating a cohort and comparing join filters. To switch between them, click on the logical operator. For a specific level of join filtering, joins are either all 'AND' or all 'OR'. Note that even when using 'OR' for two join filters, the implication that "this criteria exists" precedes the join level, i.e. “where exists, join 1 or join 2”.
Once a join filter is created, you can further define the secondary entity by adding additional criteria to the branch, or adding more layers of join filters deriving from the current branch. As you add more layers, the field selector automatically hides fields that are ineligible to be added based on the join.
For an example of interpreting join filters, consider the following:
The First Example cohort identifies all patients with a "high" or "medium" risk level who have a first (visit instance = 1) hospital visit and who also have had a lab test that was a "nasal swab". This lab test does not necessarily have to be conducted at the time of the patient’s first hospital visit. In the Second Example, the cohort includes all patients with a "high" or "medium" risk level who had the "nasal swab" test performed on the first visit.
This video provides an overview of setting up your dashboard as part of defining and refining a cohort:
You can create complex cohorts by combining existing cohorts from the same dataset. Cohort combine can be accessed via the “Compare / Combine Cohorts” menu located at the top of the page.
You can also create a combined cohort basing on the cohorts already being compared.
The Cohort Browser supports the following combination logic:
Logic | Description | Number of Cohorts Supported |
---|---|---|
Intersection | Select members that are present in ALL selected cohorts. Example: intersection of cohort A, B and C would be A∩B∩C | Up to 5 cohorts |
Union | Select members that are present in ANY of the selected cohorts. Example: union of cohort A, B and C would be A∪B∪C | Up to 5 cohorts |
Subtraction | Select members that are present only in the first selected cohort and not in the second. Example: Subtraction of cohort A, B would be A-B | 2 cohorts |
Unique | Select members that appear in exactly one of the selected cohorts. Example: Unique of cohort A, B would be (A-B) ∪ (B-A) | 2 cohorts |
Once a combined cohort is created, you can inspect the combination logic and its original cohorts in the cohort filters section.
Cohorts already combined cannot be combined a second time.
The Cohort Table can visualize up to 30 columns of data per tab. Tables with over 30 columns but under 200 will still show the column names and allow the user to save the cohort but data will not be queried. Tables with over 200 columns are not supported.
In the Data Preview tab, the Cohort Table shows records that are within your current cohort selection split up by the entity they are on. You can add or remove data fields as columns via the column customization menu, which is located in the top-right corner of the table. As you add fields, entities are automatically split out. You can have up to 5 different entities showing at a time. In the entity drop down, you can toggle between various entities you’ve added and remove them outright.
Click on table column headers to access more functionalities including sorting and searching in a specific column and the data field information. From the data field information, you can quickly add the field as a tile or a filter if it has not been added yet.
You can export table information either as a list of record IDs or a CSV file. Export options are available on the top-right corner of the table once you have selected a number of table rows.
Note that when exporting data, the resulting file will contain only those fields that are displayed in the Data Preview table.
The cohort table can display a maximum number of 30,000 records. If your cohort size is larger than this number, the table may not show the full data. Exporting a table larger than 30,000 can be done with the Table Exporter app.
Note that your view may contain more than one table (e.g. a participants table and a hospital records table). When you export the view to a CSV or TSV, doing so will yield a separate file for each table.
The Variant Browser shows variants that are present in current cohort selection.
For datasets containing germline data, the Variant Browser appears in the Genomics tab. For datasets containing somatic variants data, it appears in the Somatic Variants tab.
For datasets containing germline data, the Variant Browser includes a lollipop plot displaying allele frequencies for variants in a specified genomic region.
The table below the lollipop plot displays a list of the same variants in tabular format, along with further annotation information including:
Type: whether the variant is a SNP, deletion, insertion, or mixed.
Consequences: The impact of variant according to SNPEff. For variants with multiple gene annotations, this column displays the most severe consequence per gene.
Population Allele Frequency: Allele frequency calculated across entire dataset from which the cohort is created.
Cohort Allele Frequency: Allele frequency calculated across current cohort selection.
GnomAD Allele Frequency: Allele frequency of the specified allele from the public dataset GnomAD.
If canonical transcript information is available, the following three columns with additional annotation information will appear in the Allele Table:
Consequences (Canonical Transcript): Canonical effects per each associated gene, according to SnpEff.
HGVS DNA (Canonical Transcript): HGVS (DNA) standard terminology per each associated gene with this variant
HGVS Protein (Canonical Transcript): HGVS (Protein) standard terminology per each associated gene with this variant
To view further annotation information, you can go to the detail page of a given variant by clicking on the link in the Location column .
Downloading genomic data via visualization UI is not suitable for large datasets. You can use the SQL Runner app to download data in a more efficient way.
For datasets containing somatic variant data, the Variant Browser includes a lollipop plot showing somatic variants for the canonical protein of a single gene, that occur in the cohort under examination. Interpreting and working with this chart is very similar to working with the lollipop plot included in the Variant Browser for datasets containing germline data.
Note the following:
The lollipop plot displays variant information in one gene / canonical protein at a time.
Each “lollipop” represents amino acid changes at a given location (e.g. “Thr322Ala”), with location information visualized as horizontal position (X axis) and affected sample frequency in the current cohort visualized as height (Y axis).
Each lollipop is color-coded by consequence according to the canonical transcript. Lollipops that cover more than one consequence types are color-coded as “Multiple Consequences”
You can inspect variant statistics under each specific consequence, by interacting with the legend panel and selecting one color at a time. Selecting a particular lollipop in the plot will apply a filter to the the variants table, such that only variants corresponding the selected lollipop are displayed.
You can navigate to a different gene by entering the gene symbol in the Go to Gene field. This will also update the variants table by automatically navigating to the corresponding genomic region.
For datasets containing somatic variant data, the Variant Browser includes a chart illustrating the overall variant landscape across the top-mutated genes, for the current cohort.
Note the following:
The genes are sorted in descending order of percent of affected samples.
Samples are displayed, from left to right, from those that have the greatest number of mutated genes across all genes, to those that have the least.
A sample is considered affected, if the sample has at least one detected mutation of high or moderate impact within the canonical transcript for that gene.
Samples are color-coded by consequences. Samples with two or more detected variants are color-coded as “Multi Hit”.
Only consequences of high or moderate impact (as defined by Ensembl VEP version 109) are included in this visualization.
The variant frequency matrix plot displays up to 50 top mutated genes, for up to 500 samples for any given cohort.
You can inspect variant statistics under each specific consequence, by interacting with the legend panel and selecting one color at a time.
Hovering over a particular sample cell will open a popover window showing detailed information on the sample, including:
The Sample ID
The count of variants per consequence, within the respective gene
For datasets containing somatic variant data, the Variant Browser includes a Variants Table that provides, in tabular format, details on the variants found in a particular genomic region, in samples for the current cohort.
Note that the table displays detail on the same genomic region as the lollipop plot.
Information displayed in the Variants Table includes:
Location of variant
Reference allele of variant
Alternate allele of variant
Type of variant
Variant consequences, with entries color-coded by level of severity
HGVS cDNA
HGVS Protein
You can export selected variants in the table as a list of variant IDs or a CSV file. Export options will appear at the top-right corner of the table once you have items selected.
You can save your cohort selection to a project as a Class: Record object by clicking the Save icon in the top-right corner of the cohort panel.
Cohorts will be saved with the filters applied, along with the latest set of visualizations and dashboard layout information. Similar to Dataset objects, Cohort objects can be found under the Manage tab in your selected projects, and can be re-opened via the Explore Data option.
You can export a list of main entity IDs in your current cohort selection as a CSV file. This action can be found next to the Save Cohort button, on the top-right corner of cohort panel.
Dashboard views contain layout and configuration information that can be re-used during cohort browsing. You can save or load a dashboard view via the Dashboard Actions menu located at the top-right corner of the header area. Dashboard views are saved as "Type: DashboardView" objects, which once saved also show up in selected project folders.
You can compare two cohorts by adding both cohorts into the Cohort Browser. In cohort compare mode, all visualizations are converted to show data from both cohorts.
The Compare Cohort action can be found in the header area next to the cohort title. You can create a new cohort, duplicate the current cohort, or load a previously saved cohort.
In compare mode, you can continue to edit both cohorts and visualize the results dynamically.
Compare mode is supported only for cohorts created from the same dataset. Certain cohort browser sections and chart types are not supported in compare mode.
You can compare a cohort with its complement in the dataset by selecting the “Not In …” option in the Compare / Combine menu. Similar to combining cohorts, you must first save your current cohort before creating its not-in counterpart.
Logic | Description |
---|---|
Not In | Select members that are present in the dataset, but not in the current cohort. Example: In dataset U, the result of “Not In” A would be U-A |
Similar to combined cohorts, cohorts created using Not In cannot be used further for the creation of combined / not-in cohorts.
“Not In” cohorts are linked to the cohort they are originally based upon. Once a not-in cohort is created, further changes to the original cohort definition will no longer be reflected.
The Cohort Browser adheres to a project's download policy restrictions.
If a database is in a project that has a restricteddownloadPolicy,
then a Cohort Browser that shows a dataset, cohorts, or dashboards pointing to that database should not allow downloads, regardless of project.
If all copies of the dataset are in a project that has a restricteddownloadPolicy,
then a Cohort Browser that shows a dataset, cohorts, or dashboards pointing to that dataset should not allow downloads, regardless of project.
Check all copies of the dataset.
If at least one copy is in a "download allowed" then the download will be allowed.
If the cohort or dashboard you are launching has a restricteddownloadPolicy
, then a Cohort Browser that shows the cohort/dashboard should not allow downloads.
The dx command create_cohort generates a new Cohort object on the platform from an existing Dataset or Cohort object and using a list of primary IDs. The filters are applied to the global primary key of the dataset/cohort object. When the input is a CohortBrowser
record, the existing filters are preserved and the output record has additional filters on the global primary key. The filters are combined in a manner such that the resulting record is an intersection of the IDs present in the original input and the IDs passed through CLI. For additional details, please see create_cohort and example notebooks in the public Github repo, dnanexus/OpenBio.