Exploring and Querying Datasets

A DNAnexus Apollo license is required to access the features described on this page. Contact DNAnexus Sales for more information.

Extracting Data From a Dataset With Spark

The dx command extract_dataset provides the option to either return the data dictionary of a dataset or to retrieve the underlying data comprising a dataset that is described by the data dictionary. The command also provides options to list the name and title for entities and fields which form the dataset (results are returned alphabetically by entity and field names). Functionality is demonstrated here. When retrieving data the user will have the choice of using a private Spark resource. In most scenarios, retrieving data directly using extract_dataset may suffice, and additional compute resources may not be needed (see the example OpenBio notebooks). However when additional compute resources are needed, data is returned using the DNAnexus Thrift Server and though the server is highly available there is a fixed timeout which may prevent an extensive number of queries from executing. In scenarios where the data model has many relationships, the volume of stored data is high, and/or the volume of data to be extracted and returned is high, it may be necessary to extract data using additional private compute resources scaled accordingly so that timeouts enforced via the Thrift Server are avoided completely. If the flag --sql is provided the command will instead return a SQL statement (string) to then use when querying from a stand alone Spark-enabled App(let), such as JupyterLab.

Initiating a Spark Session

The most common way to use Spark on the DNAnexus Platform is via a Spark enabled JupyterLab notebook.
After creating a Jupyter notebook within a project, enter the commands shown below, to initiate a Spark session.
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
port <- Sys.getenv("SPARK_MASTER_PORT")
master <- paste("spark://master:", port, sep = '')
sc = spark_connect(master)

Executing SQL Queries

Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:
retrieve_sql = 'select .... from .... '
df = spark.sql(retrieve_sql)
retrieve_sql <- 'select .... from .... '
df = dbGetQuery(sc, retrieve_sql)

Query to Extract Data From Database Using extract_dataset

import subprocess
cmd = ["dx", "extract_dataset", dataset, "--fields", "entity1.field1, entity1.field2, entity2.field4", "-sql", "-o", "extracted_data.sql"]
Where dataset is the record-id or the path to the dataset or cohort, for example, “record-abc123” or “/mydirectory/mydataset.dataset.”
cmd <- paste("dx extract_dataset", dataset, " --fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o extracted_data.sql")
Where dataset is the record-id or the path to the dataset or cohort.

Run SQL Query to Extract Data

with open("extracted_data.sql", "r") as file:
for line in file:
retrieve_sql += line.strip()
df = spark.sql(retrieve_sql.strip(";"))
retrieve_sql <-read_file("extracted_data.sql")
retrieve_sql <- gsub("[;\n]", "", retrieve_sql)
df <- dbGetQuery(sc, retrieve_sql)

Best Practices

  • When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.
  • Ensure that your Spark session is only initialized once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - e.g. run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session will be corrupted and you will need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.
  • If you would like to use a database outside your project's scope, you must refer to it using it's unique database name (typically this will look something like database_fjf3y28066y5jxj2b0gz4g85__metabric_data) as opposed to the database name (metabric_data in this case).