Spark

NOTE: Not all features are available in all packages. Please contact sales@dnanexus.com for more information.

Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is as easy as sharing a project: our access levels on the platform map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.

There are two ways to connect to our Spark service: through our Thrift server or, for more scalable throughput, using Spark applications.

Thrift server

We host a high-availability Thrift server with which you can connect over JDBC with a client-like beeline to run Spark SQL interactively. Refer to the Thrift Server page for more details.

Spark applications

You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs will leverage the features of normal jobs. You'll have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit and the platform web UI. You'll additionally have access to logs from workers and be able to monitor the job in the Spark UI.

Visualization

With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts very quickly without the need to write complex SQL queries by hand.

Databases

A database is a data object representing a Spark database in the platform. A database object is stored in a project.

Database sharing

Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is not possible, the database can be relocated to another project with different set of collaborators.

Database and project policies

Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context, when connecting to Thrift. Databases also adhere to the Protected Health Information (PHI) Data Protection policy. If a database is in a project marked as containing PHI ("PHI project"), the database will be subject to the following protective restrictions:

  • Spark apps launched in non-PHI projects will not be able to access any databases in PHI projects.

  • If a non-PHI project was provided as a project context when connecting to Thrift, only databases from non-PHI projects will be available for retrieving data.

  • If a PHI project was provided as a project context when connecting to Thrift, only databases from PHI projects will be available to add new data.

Database access

As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object states translate into specific SQL abilities for the database, tables, data and database object in the project.

The following tables reference supported actions on a Spark database and database object with lowest necessary access level for an open and closed database.

Spark SQL Function

Open Database

Closed Database

ALTER DATABASE SET DBPROPERTIES

CONTRIBUTE

N/A

ALTER TABLE RENAME

UPLOAD

N/A

ALTER TABLE DROP PARTITION

CONTRIBUTE (*)

N/A

ALTER TABLE RENAME PARTITION

CONTRIBUTE

N/A

ANALYZE TABLE COMPUTE STATISTICS

UPLOAD

N/A

CACHE TABLE, CLEAR CACHE

N/A

N/A

CREATE DATABASE

UPLOAD

UPLOAD

CREATE FUNCTION

N/A

N/A

CREATE TABLE

UPLOAD

N/A

CREATE VIEW

UPLOAD

UPLOAD

DESCRIBE DATABASE, TABLE, FUNCTION

VIEW

VIEW

DROP DATABASE

CONTRIBUTE (*)

ADMINISTER

DROP FUNCTION

N/A

N/A

DROP TABLE

CONTRIBUTE (*)

N/A

EXPLAIN

VIEW

VIEW

INSERT

UPLOAD

N/A

REFRESH TABLE

VIEW

VIEW

RESET

VIEW

VIEW

SELECT

VIEW

VIEW

SET

VIEW

VIEW

SHOW COLUMNS

VIEW

VIEW

SHOW DATABASES

VIEW

VIEW

SHOW FUNCTIONS

VIEW

VIEW

SHOW PARTITIONS

VIEW

VIEW

SHOW TABLES

VIEW

VIEW

TRUNCATE TABLE

UPLOAD

N/A

UNCACHE TABLE

VIEW

VIEW

Data Object Action

Open Database

Closed Database

Add Tags

UPLOAD

CONTRIBUTE

Add Types

UPLOAD

N/A

Close

UPLOAD

N/A

Get Details

VIEW

VIEW

Remove

CONTRIBUTE (*)

ADMINISTER

Remove Tags

UPLOAD

CONTRIBUTE

Remove Types

UPLOAD

N/A

Rename

UPLOAD

CONTRIBUTE

Set Details

UPLOAD

N/A

Set Properties

UPLOAD

CONTRIBUTE

Set Visibility

UPLOAD

N/A

(*) If a project is protected, then ADMINISTER access is required.