Using Spark
Connect with Spark for database sharing, big data analytics, and rich visualizations.
Last updated
Connect with Spark for database sharing, big data analytics, and rich visualizations.
Last updated
Copyright 2024 DNAnexus
A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is as easy as sharing a project: our access levels on the platform map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.
There are two ways to connect to our Spark service: through our Thrift server or, for more scalable throughput, using Spark applications.
DNAnexus hosts a high-availability Thrift server with which you can connect over JDBC with a client-like beeline to run Spark SQL interactively. Refer to the Thrift Server page for more details.
You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs will leverage the features of normal jobs. You'll have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of dx-toolkit
and the platform web UI. You'll additionally have access to logs from workers and be able to monitor the job in the Spark UI.
With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts very quickly without the need to write complex SQL queries by hand.
A database is a data object on the Platform. A database object is stored in a project.
Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is not possible, the database can be relocated to another project with different set of collaborators.
Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context, when connecting to Thrift. Databases also adhere to the project's PHI Data Protection policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:
The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").
If a non-PHI project is provided as a project context when connecting to Thrift, only databases from non-PHI projects will be available for retrieving data.
If a PHI project is provided as a project context when connecting to Thrift, only databases from PHI projects will be available to add new data.
A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. Contact DNAnexus Sales for more information.
As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object states translate into specific SQL abilities for the database, tables, data and database object in the project.
The following tables reference supported actions on a database and database object with lowest necessary access level for an open and closed database.
Spark SQL Function
Open Database
Closed Database
ALTER DATABASE SET DBPROPERTIES
CONTRIBUTE
N/A
ALTER TABLE RENAME
CONTRIBUTE
N/A
ALTER TABLE DROP PARTITION
CONTRIBUTE (*)
N/A
ALTER TABLE RENAME PARTITION
CONTRIBUTE
N/A
ANALYZE TABLE COMPUTE STATISTICS
UPLOAD
N/A
CACHE TABLE, CLEAR CACHE
N/A
N/A
CREATE DATABASE
UPLOAD
UPLOAD
CREATE FUNCTION
N/A
N/A
CREATE TABLE
UPLOAD
N/A
CREATE VIEW
UPLOAD
UPLOAD
DESCRIBE DATABASE, TABLE, FUNCTION
VIEW
VIEW
DROP DATABASE
CONTRIBUTE (*)
ADMINISTER
DROP FUNCTION
N/A
N/A
DROP TABLE
CONTRIBUTE (*)
N/A
EXPLAIN
VIEW
VIEW
INSERT
UPLOAD
N/A
REFRESH TABLE
VIEW
VIEW
RESET
VIEW
VIEW
SELECT
VIEW
VIEW
SET
VIEW
VIEW
SHOW COLUMNS
VIEW
VIEW
SHOW DATABASES
VIEW
VIEW
SHOW FUNCTIONS
VIEW
VIEW
SHOW PARTITIONS
VIEW
VIEW
SHOW TABLES
VIEW
VIEW
TRUNCATE TABLE
UPLOAD
N/A
UNCACHE TABLE
N/A
N/A
Data Object Action
Open Database
Closed Database
Add Tags
UPLOAD
CONTRIBUTE
Add Types
UPLOAD
N/A
Close
UPLOAD
N/A
Get Details
VIEW
VIEW
Remove
CONTRIBUTE (*)
ADMINISTER
Remove Tags
UPLOAD
CONTRIBUTE
Remove Types
UPLOAD
N/A
Rename
UPLOAD
CONTRIBUTE
Set Details
UPLOAD
N/A
Set Properties
UPLOAD
CONTRIBUTE
Set Visibility
UPLOAD
N/A
(*) If a project is protected, then ADMINISTER access is required.
When users create a database, the name the user provides is validated and downcased before it's stored as the databaseName
attribute of the database object. In addition, a unique database name is generated by downcasing database object ID, replacing the hyphen with an underscore, and concatenating it with two underscores to an updated database name. The unique database name is stored as the uniqueDatabaseName
attribute of the database object.
When a database is created using the following SQL statement and a user-generated database name (referenced below as, db_name
):
The platform database object, database-xxxx
, is created with all lowercase characters. However, when creating a database using dxpy
, the Python module supported by the DNAnexus SDK, dx-toolkit
, the following case-sensitive command returns a database ID based on the user-generated database name, assigned here to the variable db_name
:
With that in mind, it is suggested to either use lowercase characters in your db_name
assignment or to instead apply a forcing function like, .lower()
, to the user-generated database name: