Using Spark
Connect with Spark for database sharing, big data analytics, and rich visualizations.
NOTE: Not all features are available in all packages. Please contact [email protected] for more information.
Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is as easy as sharing a project: our access levels on the platform map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.
There are two ways to connect to our Spark service: through our Thrift server or, for more scalable throughput, using Spark applications.

Thrift server

We host a high-availability Thrift server with which you can connect over JDBC with a client-like beeline to run Spark SQL interactively. Refer to the Thrift Server page for more details.

Spark applications

You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs will leverage the features of normal jobs. You'll have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit and the platform web UI. You'll additionally have access to logs from workers and be able to monitor the job in the Spark UI.

Visualization

With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts very quickly without the need to write complex SQL queries by hand.

Databases

A database is a data object representing a Spark database in the platform. A database object is stored in a project.

Database sharing

Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is not possible, the database can be relocated to another project with different set of collaborators.

Database and project policies

Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context, when connecting to Thrift. Databases also adhere to the Protected Health Information (PHI) Data Protection policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:
    The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").
    If a non-PHI project is provided as a project context when connecting to Thrift, only databases from non-PHI projects will be available for retrieving data.
    If a PHI project is provided as a project context when connecting to Thrift, only databases from PHI projects will be available to add new data.

Database access

As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object states translate into specific SQL abilities for the database, tables, data and database object in the project.
The following tables reference supported actions on a Spark database and database object with lowest necessary access level for an open and closed database.
Spark SQL Function
Open Database
Closed Database
ALTER DATABASE SET DBPROPERTIES
CONTRIBUTE
N/A
ALTER TABLE RENAME
CONTRIBUTE
N/A
ALTER TABLE DROP PARTITION
CONTRIBUTE (*)
N/A
ALTER TABLE RENAME PARTITION
CONTRIBUTE
N/A
ANALYZE TABLE COMPUTE STATISTICS
UPLOAD
N/A
CACHE TABLE, CLEAR CACHE
N/A
N/A
CREATE DATABASE
UPLOAD
UPLOAD
CREATE FUNCTION
N/A
N/A
CREATE TABLE
UPLOAD
N/A
CREATE VIEW
UPLOAD
UPLOAD
DESCRIBE DATABASE, TABLE, FUNCTION
VIEW
VIEW
DROP DATABASE
CONTRIBUTE (*)
ADMINISTER
DROP FUNCTION
N/A
N/A
DROP TABLE
CONTRIBUTE (*)
N/A
EXPLAIN
VIEW
VIEW
INSERT
UPLOAD
N/A
REFRESH TABLE
VIEW
VIEW
RESET
VIEW
VIEW
SELECT
VIEW
VIEW
SET
VIEW
VIEW
SHOW COLUMNS
VIEW
VIEW
SHOW DATABASES
VIEW
VIEW
SHOW FUNCTIONS
VIEW
VIEW
SHOW PARTITIONS
VIEW
VIEW
SHOW TABLES
VIEW
VIEW
TRUNCATE TABLE
UPLOAD
N/A
UNCACHE TABLE
VIEW
VIEW
Data Object Action
Open Database
Closed Database
Add Tags
UPLOAD
CONTRIBUTE
Add Types
UPLOAD
N/A
Close
UPLOAD
N/A
Get Details
VIEW
VIEW
Remove
CONTRIBUTE (*)
ADMINISTER
Remove Tags
UPLOAD
CONTRIBUTE
Remove Types
UPLOAD
N/A
Rename
UPLOAD
CONTRIBUTE
Set Details
UPLOAD
N/A
Set Properties
UPLOAD
CONTRIBUTE
Set Visibility
UPLOAD
N/A
(*) If a project is protected, then ADMINISTER access is required.
Last modified 2mo ago