Using Spark

Connect with Spark for database sharing, big data analytics, and rich visualizations.

A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.

Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is as easy as sharing a project: our access levels on the platform map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.

There are two ways to connect to our Spark service: through our Thrift server or, for more scalable throughput, using Spark applications.

Thrift Server

DNAnexus hosts a high-availability Thrift server with which you can connect over JDBC with a client-like beeline to run Spark SQL interactively. Refer to the Thrift Server page for more details.

Spark Applications

You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs will leverage the features of normal jobs. You'll have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit and the platform web UI. You'll additionally have access to logs from workers and be able to monitor the job in the Spark UI.

Visualization

With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts very quickly without the need to write complex SQL queries by hand.

Databases

A database is a data object on the Platform. A database object is stored in a project.

Database Sharing

Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is not possible, the database can be relocated to another project with different set of collaborators.

Database and Project Policies

Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context, when connecting to Thrift. Databases also adhere to the project's PHI Data Protection policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:

  • The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").

  • If a non-PHI project is provided as a project context when connecting to Thrift, only databases from non-PHI projects will be available for retrieving data.

  • If a PHI project is provided as a project context when connecting to Thrift, only databases from PHI projects will be available to add new data.

A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. Contact DNAnexus Sales for more information.

Database Access

As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object states translate into specific SQL abilities for the database, tables, data and database object in the project.

The following tables reference supported actions on a database and database object with lowest necessary access level for an open and closed database.

Spark SQL Function

Open Database

Closed Database

ALTER DATABASE SET DBPROPERTIES

CONTRIBUTE

N/A

ALTER TABLE RENAME

CONTRIBUTE

N/A

ALTER TABLE DROP PARTITION

CONTRIBUTE (*)

N/A

ALTER TABLE RENAME PARTITION

CONTRIBUTE

N/A

ANALYZE TABLE COMPUTE STATISTICS

UPLOAD

N/A

CACHE TABLE, CLEAR CACHE

N/A

N/A

CREATE DATABASE

UPLOAD

UPLOAD

CREATE FUNCTION

N/A

N/A

CREATE TABLE

UPLOAD

N/A

CREATE VIEW

UPLOAD

UPLOAD

DESCRIBE DATABASE, TABLE, FUNCTION

VIEW

VIEW

DROP DATABASE

CONTRIBUTE (*)

ADMINISTER

DROP FUNCTION

N/A

N/A

DROP TABLE

CONTRIBUTE (*)

N/A

EXPLAIN

VIEW

VIEW

INSERT

UPLOAD

N/A

REFRESH TABLE

VIEW

VIEW

RESET

VIEW

VIEW

SELECT

VIEW

VIEW

SET

VIEW

VIEW

SHOW COLUMNS

VIEW

VIEW

SHOW DATABASES

VIEW

VIEW

SHOW FUNCTIONS

VIEW

VIEW

SHOW PARTITIONS

VIEW

VIEW

SHOW TABLES

VIEW

VIEW

TRUNCATE TABLE

UPLOAD

N/A

UNCACHE TABLE

N/A

N/A

Data Object Action

Open Database

Closed Database

Add Tags

UPLOAD

CONTRIBUTE

Add Types

UPLOAD

N/A

Close

UPLOAD

N/A

Get Details

VIEW

VIEW

Remove

CONTRIBUTE (*)

ADMINISTER

Remove Tags

UPLOAD

CONTRIBUTE

Remove Types

UPLOAD

N/A

Rename

UPLOAD

CONTRIBUTE

Set Details

UPLOAD

N/A

Set Properties

UPLOAD

CONTRIBUTE

Set Visibility

UPLOAD

N/A

(*) If a project is protected, then ADMINISTER access is required.

Database Naming Conventions

When users create a database, the name the user provides is validated and downcased before it's stored as the databaseName attribute of the database object. In addition, a unique database name is generated by downcasing database object ID, replacing the hyphen with an underscore, and concatenating it with two underscores to an updated database name. The unique database name is stored as the uniqueDatabaseName attribute of the database object.

When a database is created using the following SQL statement and a user-generated database name (referenced below as, db_name):

CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://

The platform database object, database-xxxx, is created with all lowercase characters. However, when creating a database using dxpy, the Python module supported by the DNAnexus SDK, dx-toolkit, the following case-sensitive command returns a database ID based on the user-generated database name, assigned here to the variable db_name:

db_uri = dxpy.find_one_data_object(name=db_name", classname="database")['id']

With that in mind, it is suggested to either use lowercase characters in your db_name assignment or to instead apply a forcing function like, .lower(), to the user-generated database name:

db_uri = dxpy.find_one_data_object(name=db_name.lower(), classname="database")['id']

Last updated