Exploring and Querying Datasets Using Spark
An introduction to using Spark with datasets on the DNAnexus Platform.
With DNAnexus Apollo, users can leverage Spark to explore and query large, complex datasets, in an environment that can scale to handle millions of rows and columns of data.
Not all features are included in all packages. Contact DNAnexus Sales for more information.

Querying In Python Using Native Spark within a Jupyter Notebook

Initiating a Spark Session

The most common way to use Spark on the DNAnexus Platform is via a Spark enabled JupyterLab notebook.
After creating a Jupyter notebook within a project, enter the commands shown below, to initiate a Spark session. It's good practice to do this at the very beginning of your analyses.
import pyspark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

Executing SQL Queries

Once you've initiated a Spark session, you can run SQL queries on the database within your notebook, with the results written to a Spark DataFrame:
retrieve_sql = 'select .... from .... '
df = spark.sql(retrieve_sql)

Using dxdata

dxdata is a Python client developed and maintained by DNAnexus. dxdata is specific to the DNAnexus platform, and can only be used with datasets, cohorts, and dashboards.dxdata enables users easily and quickly to:
    Initiate a Spark session
    Connect to and explore datasets, cohorts, dashboards, or databases
    Extract data to a DataFrame

dxdata Documentation

See below for dxdata library documentation. This information is also available in the docstrings of the functions.
dnanexus dxdata documentation 20210623.pdf

Useful Notebooks

Download this sample Jupyter notebook to explore and learn to use key dxdata functionality.

Best Practices

    When querying large datasets - such as those containing genomic data - ensure that your Spark cluster is scaled up appropriately with multiple clusters to parallelize across.
    Ensure that your Spark session is only initialized once per Jupyter session. If you initialize the Spark session in multiple notebooks in the same Jupyter Job - e.g. run notebook 1 and also run notebook 2 OR run a notebook from start to finish multiple times - the Spark session will be corrupted and you will need to restart the specific notebook's kernel. As a best practice, shut down the kernel of any notebook you are not using, before running a second notebook in the same session.
Last modified 13d ago