# Using Spark

{% hint style="info" %}
A license is required to access Spark functionality on the DNAnexus Platform. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

Apache Spark can help you tackle big data analytics combined with rich visualization. Sharing a database is straightforward: platform access levels map directly to SQL abilities, so you can fine-tune access control to your databases at either an individual or org level.

## Spark Applications

You can launch a Spark application distributed across a cluster of workers. Since this is all tightly integrated with the rest of the platform, Spark jobs leverage the features of normal jobs. You have the same ways to monitor a job's progress, SSH into a job instance to debug, and use the features of `dx-toolkit` and the platform web UI. You also have access to logs from workers and can monitor the job in the Spark UI.

![](https://1612471957-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-L_EsL_ie8XyZlLe_yf9%2Fuploads%2Fgit-blob-1da430b7bf9161235b31b75df235624574d16906%2Fspark-cluster-diagram.png?alt=media)

## Visualization

With Spark, you can visualize your results in real time. You can save those queries as cohorts, share them with your team, or use them as inputs to Spark-based analysis apps. You can create charts and shareable dashboards. The filter view allows you to build cohorts quickly without the need to write complex SQL queries by hand.

## Databases

A database is a [data object](https://documentation.dnanexus.com/developer/api/introduction-to-data-object-classes) on the Platform. A [database](https://documentation.dnanexus.com/developer/api/introduction-to-data-object-classes/databases) object is stored in a project.

### Database Sharing

Databases can be shared with other users or organizations through project sharing. Access to a database can be revoked at any time by revoking access to the project by the project administrator. If revoking access to the project is impossible, the database can be [relocated](https://documentation.dnanexus.com/developer/api/introduction-to-data-object-classes/databases#api-method-database-xxxx-relocate) to another project with different set of collaborators.

### Database and Project Policies

Project policies restrict how the data can be modified or copied to other projects. Databases follow the Delete Policy and the Copy Policy. If a database is in a restricted project, the database can be accessed for reading only from the same project context. Databases also adhere to the project's [PHI Data Protection](https://documentation.dnanexus.com/getting-started/key-concepts/projects#phi-data-protection) policy. If a database is in a project for which Data Protection is enabled ("PHI project"), the database is subject to the following restrictions:

* The database cannot be accessed by Spark apps launched in projects for which PHI Data Protection is not enabled ("non-PHI projects").
* If a non-PHI project is provided as a project context, only databases from non-PHI projects are available for retrieving data.
* If a PHI project is provided as a project context, only databases from PHI projects are available to add new data.

{% hint style="info" %}
A license and a signed Business Associate Agreement are required to enable and use PHI Data Protection. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

### Database Access

As with all DNAnexus file objects, database access is controlled by project access. These access levels and database object [states](https://documentation.dnanexus.com/developer/api/data-object-lifecycle) translate into specific SQL abilities for the database, tables, data and database object in the project.

The following tables reference supported actions on a database and database object with lowest necessary access level for an open and closed database.

| Spark SQL Function                 | Open Database   | Closed Database |
| ---------------------------------- | --------------- | --------------- |
| ALTER DATABASE SET DBPROPERTIES    | CONTRIBUTE      | N/A             |
| ALTER TABLE RENAME                 | CONTRIBUTE      | N/A             |
| ALTER TABLE DROP PARTITION         | CONTRIBUTE (\*) | N/A             |
| ALTER TABLE RENAME PARTITION       | CONTRIBUTE      | N/A             |
| ANALYZE TABLE COMPUTE STATISTICS   | UPLOAD          | N/A             |
| CACHE TABLE, CLEAR CACHE           | N/A             | N/A             |
| CREATE DATABASE                    | UPLOAD          | UPLOAD          |
| CREATE FUNCTION                    | N/A             | N/A             |
| CREATE TABLE                       | UPLOAD          | N/A             |
| CREATE VIEW                        | UPLOAD          | UPLOAD          |
| DESCRIBE DATABASE, TABLE, FUNCTION | VIEW            | VIEW            |
| DROP DATABASE                      | CONTRIBUTE (\*) | ADMINISTER      |
| DROP FUNCTION                      | N/A             | N/A             |
| DROP TABLE                         | CONTRIBUTE (\*) | N/A             |
| EXPLAIN                            | VIEW            | VIEW            |
| INSERT                             | UPLOAD          | N/A             |
| REFRESH TABLE                      | VIEW            | VIEW            |
| RESET                              | VIEW            | VIEW            |
| SELECT                             | VIEW            | VIEW            |
| SET                                | VIEW            | VIEW            |
| SHOW COLUMNS                       | VIEW            | VIEW            |
| SHOW DATABASES                     | VIEW            | VIEW            |
| SHOW FUNCTIONS                     | VIEW            | VIEW            |
| SHOW PARTITIONS                    | VIEW            | VIEW            |
| SHOW TABLES                        | VIEW            | VIEW            |
| TRUNCATE TABLE                     | UPLOAD          | N/A             |
| UNCACHE TABLE                      | N/A             | N/A             |

| Data Object Action | Open Database   | Closed Database |
| ------------------ | --------------- | --------------- |
| Add Tags           | UPLOAD          | CONTRIBUTE      |
| Add Types          | UPLOAD          | N/A             |
| Close              | UPLOAD          | N/A             |
| Get Details        | VIEW            | VIEW            |
| Remove             | CONTRIBUTE (\*) | ADMINISTER      |
| Remove Tags        | UPLOAD          | CONTRIBUTE      |
| Remove Types       | UPLOAD          | N/A             |
| Rename             | UPLOAD          | CONTRIBUTE      |
| Set Details        | UPLOAD          | N/A             |
| Set Properties     | UPLOAD          | CONTRIBUTE      |
| Set Visibility     | UPLOAD          | N/A             |

(\*) If a project is protected, then ADMINISTER access is required.

### Database Naming Conventions

The system handles database names in two ways:

* **User-provided name**: Your database name is converted to lowercase and stored as the `databaseName` attribute.
* **System-generated unique name**: A unique identifier is created by combining your lowercase database name with the database object ID (also converted to lowercase with hyphens changed to underscores) separated by two underscores. This is stored as the `uniqueDatabaseName` attribute.

When a database is created using the following SQL statement and a user-generated database name (referenced below as, `db_name`):

```sql
CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://
```

The platform database object, `database-xxxx`, is created with all lowercase characters. However, when creating a database using [`dxpy`](https://documentation.dnanexus.com/downloads), the Python module supported by the DNAnexus SDK, `dx-toolkit`, the following case-sensitive command returns a database ID based on the user-generated database name, assigned here to the variable `db_name`:

```python
db_uri = dxpy.find_one_data_object(name=db_name", classname="database")['id']
```

With that in mind, it is suggested to either use lowercase characters in your `db_name` assignment or to instead apply a forcing function like, `.lower()`, to the user-generated database name:

```python
db_uri = dxpy.find_one_data_object(name=db_name.lower(), classname="database")['id']
```
