Connect to Thrift

Learn about the DNAnexus Thrift server, a service that allows JDBC and ODBC clients to run Spark SQL queries.

A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.

About the DNAnexus Thrift Server

The DNAnexus Thrift server connects to a high availability Apache Spark cluster integrated with the platform. It leverages the same security, permissions, and sharing features built into DNAnexus.

Connecting to Thrift Server

Prerequisites:

The JDBC URL:

 AWS US (East): jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/;ssl=true
 AWS London (UKB): jdbc:hive2://query.eu-west-2.apollo.dnanexus.com:10000/;ssl=true
 Azure US (West): jdbc:hive2://query.westus.apollo.dnanexus.com:10001/;ssl=true;transportMode=http;httpPath=cliservice
 AWS Frankfurt (General): jdbc:hive2://query.eu-central-1.apollo.dnanexus.com:10000/;ssl=true

The username in the format TOKEN__PROJECTID, where:
- TOKEN is a DNAnexus user-generated token, separated by a double underscore (__) from the project ID.
- PROJECTID is a DNAnexus project ID used as the project context (when creating databases).
- The Thrift server and the project must be in the same region.

Generate a DNAnexus Platform Authentication Token

See the Authentication tokens page.

Getting the Project ID

Navigate to https://platform.dnanexus.com and login using your username and password.
In Projects > your project > Settings >, for Project ID, click Copy to Clipboard.

Using Beeline

Beeline is a JDBC client bundled with Apache Spark that can be used to run interactive queries on the command line.

Installing Apache Spark

You can download Apache Spark 3.5.2 for Hadoop 3.x from here.

tar -zxvf spark-3.5.2-bin-hadoop3.tgz

You need to have Java installed in your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

Single Command Connection

If you already have beeline installed and all the credentials, you can quickly connect with the following command:

<beeline> -u <thrift path> -n <token>__<project-id>

In the following AWS example, you must escape some characters (; with \):

$SPARK_HOME/bin/beeline -u jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/\;ssl=true -n yourToken__project-xxxx

The command for connecting to Thrift on Azure has a different format:

$SPARK_HOME/bin/beeline -u jdbc:hive2://query.westus.apollo.dnanexus.com:10001/\;ssl=true\;transportMode=http\;httpPath=cliservice -n yourToken__project-xxxx

Running Beeline Guided Connection

The beeline client is located under $SPARK_HOME/bin/.

cd spark-3.5.2-bin-hadoop3/bin
./beeline

Connect to beeline using the JDBC URL:

$ beeline> !connect jdbc:hive2://query.us-east-1.apollo.dnanexus.com:10000/;ssl=true

Enter username: <TOKEN__PROJECTID>
Enter password: <empty - press RETURN>

Once successfully connected, you should see the message:

Connected to: Spark SQL (version 3.5.2)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Querying in Beeline

After connecting to the Thrift server using your credentials, you can view all databases you have access to within your current region.

0: jdbc:hive2://query.us-east-1.apollo.dnanex> show databases;
+---------------------------------------------------------+--+
|                      databaseName                       |
+---------------------------------------------------------+--+
| database_fj7q18009xxzzzx0gjfk6vfz__genomics_180718_01   |
| database_fj8gygj0v10vj50j0gyfqk1x__af_result_180719_01  |
| database_fj96qx00v10vj50j0gyfv00z__af_result2           |
| database_fjf3y28066y5jxj2b0gz4g85__metabric_data        |
| database_fjj1jkj0v10p8pvx78vkkpz3__pchr1_test           |
| database_fjpz6fj0v10fjy3fjy282ybz__af_result1           |
+---------------------------------------------------------+--+

You can query using the unique database name, which includes the lowercase database ID (for example, database_fjf3y28066y5jxj2b0gz4g85__metabric_data). If the database is in the same username and project used to connect to the Thrift server, you can use only the database name (for example, metabric_data). For databases outside the project, use the unique database name.

0: jdbc:hive2://query.us-east-1.apollo.dnanex> use metabric_data;

Databases stored in other projects can be found by specifying the project context in the LIKE option of SHOW DATABASES, using the format '<project-id>:<database pattern>' as shown below:

0: jdbc:hive2://query.us-east-1.apollo.dnanex> SHOW DATABASES LIKE 'project-xxx:af*';
+---------------------------------------------------------+--+
|                      databaseName                       |
+---------------------------------------------------------+--+
| database_fj8gygj0v10vj50j0gyfqk1x__af_result_180719_01  |
| database_fj96qx00v10vj50j0gyfv00z__af_result2           |
| database_fjpz6fj0v10fjy3fjy282ybz__af_result1           |
+---------------------------------------------------------+--+

After connecting, you can run SQL queries.

0: jdbc:hive2://query.us-east-1.apollo.dnanex> select * from cna limit 10;
+--------------+-----------------+------------+--------+--+
| hugo_symbol  | entrez_gene_id  | sample_id  | value  |
+--------------+-----------------+------------+--------+--+
| MIR3675      | NULL            | MB-6179    | -1     |
| MIR3675      | NULL            | MB-6181    | 0      |
| MIR3675      | NULL            | MB-6182    | 0      |
| MIR3675      | NULL            | MB-6183    | 0      |
| MIR3675      | NULL            | MB-6184    | 0      |
| MIR3675      | NULL            | MB-6185    | -1     |
| MIR3675      | NULL            | MB-6187    | 0      |
| MIR3675      | NULL            | MB-6188    | 0      |
| MIR3675      | NULL            | MB-6189    | 0      |
| MIR3675      | NULL            | MB-6190    | 0      |
+--------------+-----------------+------------+--------+--+

Last updated 10 days ago

Was this helpful?