# Exploring and Querying Datasets

{% hint style="info" %}
A license is required to access Spark functionality on the DNAnexus Platform. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

## Extracting Data From a Dataset With Spark

The `dx` commands, [`extract_dataset`](/user/helpstrings-of-sdk-command-line-utilities.md#extract_dataset) and [`extract_assay germline`](/user/helpstrings-of-sdk-command-line-utilities.md#extract_assay-germline), let you either retrieve the data dictionary of a dataset or extract the underlying data described by that dictionary. You can also use these commands to get dataset metadata, such as the names and titles of entities and fields, or to list all relevant assays in a dataset.

Often, you can retrieve data without using Spark, and extra compute resources are not required (see the [example OpenBio notebooks](https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit)). However, if you need more compute power—such as when working with complex data models, large datasets, or extracting large volumes of data—you can use a private Spark resource. Using private compute resources helps avoid these timeouts by scaling resources as needed.

If you use the `--sql` flag, the command returns a SQL statement (as a string) that you can use in a standalone Spark-enabled application, such as JupyterLab.

### Initiating a Spark Session

The most common way to use Spark on the DNAnexus Platform is via a [Spark enabled JupyterLab notebook](/user/jupyter-notebooks/dxjupyterlab-spark-cluster.md).

After creating a Jupyter notebook within a project, enter the commands shown below, to start a Spark session.

Python:

```python
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
```

R:

```r
install.packages("sparklyr")
library(sparklyr)
port <- Sys.getenv("SPARK_MASTER_PORT")
master <- paste("spark://master:", port, sep = '')
sc = spark_connect(master)
```

### Executing SQL Queries

Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:

Python:

```python
retrieve_sql = 'select .... from .... '
df = spark.sql(retrieve_sql)
```

R:

```r
library(DBI)
retrieve_sql <- 'select .... from .... '
df = dbGetQuery(sc, retrieve_sql)
```

### Query to Extract Data From Database Using `extract_dataset`

Python:

```python
import subprocess
cmd = ["dx", "extract_dataset", dataset, "--fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o", "extracted_data.sql"]
subprocess.check_call(cmd)
```

Where `dataset` is the `record-id` or the path to the dataset or cohort, for example, "record-abc123" or "/mydirectory/mydataset.dataset."

R:

```r
cmd <- paste("dx extract_dataset", dataset, " --fields", "entity1.field1, entity1.field2, entity2.field4", "--sql", "-o extracted_data.sql")
system(cmd)
```

Where `dataset` is the `record-id` or the path to the dataset or cohort.

### Query to Filter and Extract Data from Database Using `extract_assay germline`

Python:

```python
import subprocess
cmd = ["dx", "extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o", "extract_allele.sql"]
subprocess.check_call(cmd)
```

R:

```r
cmd <- paste("dx extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "--sql", "-o extracted_allele.sql")
system(cmd)
```

In the examples above, dataset is the `record-id` or the path to the dataset or cohort, for example, `record-abc123` or `/mydirectory/mydataset.dataset`. `allele_filter.json` is a JSON file that contains filters for the `--retrieve-allele` command. For more information, refer to the notebooks in the [DNAnexus OpenBio dx-toolkit examples](https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit).

### Run SQL Query to Extract Data

Python:

```python
with open("extracted_data.sql", "r") as file:
    retrieve_sql=""
    for line in file:
        retrieve_sql += line.strip()
df = spark.sql(retrieve_sql.strip(";"))
```

R:

```r
install.packages("tidyverse")
library(readr)
retrieve_sql <-read_file("extracted_data.sql")
retrieve_sql <- gsub("[;\n]", "", retrieve_sql)
df <- dbGetQuery(sc, retrieve_sql)
```

## Best Practices

* When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple worker nodes to parallelize across.
* Ensure that your Spark session is only [initialized](#initiating-a-spark-session) once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - for example, run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session becomes corrupted and you need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.
* If you want to use a database outside your project's scope, you must refer to it using its unique database name (typically this looks something like `database_fjf3y28066y5jxj2b0gz4g85__metabric_data`) as opposed to the database name (`metabric_data` in this case).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster/exploring-and-querying-with-spark.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
