SQL Runner
A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.
Overview
The Spark SQL Runner application brings up a Spark cluster and executes your provided list of SQL queries. This is especially useful if you need to perform a sequence repeatedly or if you need to run a complex set of queries. You can vary the size of your cluster to speed up your tasks.
How to Run Spark SQL Runner
Input:
sqlfile
: [Required] A SQL file which contains an ordered list of SQL queries.substitutions
: A JSON file which contains the variable substitutions.user_config
: User configuration JSON file, in case you want to set or override certain Spark configurations.
Other Options:
export
: (boolean) defaultfalse
. Will export output files with results for the queries in thesqlfile
export_options
: A JSON file which contains the export configurations.collect_logs
: (boolean) defaultfalse
. Collects cluster logs from all nodes.executor_memory
: (string) Amount of memory to use per executor process, in MiB unless otherwise specified. (e.g. 2g, 8g). This is passed as--executor-memory
to Spark submit.executor_cores
: (integer) Number of cores to use per executor process. This is passed as--executor-cores
to Spark submit.driver_memory
: (string) Amount of memory to use for the driver process (e.g. 2g, 8g). This is passed as--driver-memory
to Spark submit.log_level
: (string) defaultINFO
. Logging level for both driver and executors.[ALL, TRACE, DEBUG, INFO]
Output:
output_files
: Output files include report SQL file and query export files.
Basic Run
Examples
sqlfile
How sqlfile is Processed
The SQL runner extracts each command in
sqlfile
and runs them in sequential order.Every SQL command needs to be separated with a semicolon
;
.
Any command starting with
--
is ignored (comments). Any comment within a command should be inside/*...*/
The following are examples of valid comments:
Variable Substitution
Variable substitution can be done by specifying the variables to replace in substitutions
.
In the above example, each reference to srcdb
in sqlfile within ${...}
will be substituted with sskrdemo1
. For example, select * from ${srcdb}.${patient_table};
. The script adds the set
command before executing any of the SQL commands in sqlfile
. So select * from ${srcdb}.${patient_table};
would translate to:
Export
If enabled, the results of the SQL commands will be exported to a CSV file. export_options
defines an export configuration.
num_files
: default 1. This is specified to define the maximum output files you want to generate. This generally depends on how many executors you are running in the cluster as well as how many partitions of this file exist in the system. Each output file corresponds to a part file in parquet.fileprefix
: The filename prefix for every SQL output file. By default the output files will be prefixed with query_id which is the order in which the queries are listed in sqlfile (starting with 1). For example1-out.csv
. If we specify prefix, it will generate output files like<prefix>-1-out.csv
.header
: Default is true. If true, it add header to each exported file.
User Configuration
These values in spark-defaults.conf will override or add to the default Spark configuration.
Output Files
There are two files generated in export folder:
<JobId>
-export.tar : Contains all the query results.<JobId>
-outfile.sql : SQL debug file.
Export Files
Extracting the export tar file will look like:
In the above example, demo is the fileprefix
used. We have one folder for each query, and each folder has a .sql file containing the query executed and a .csv folder containing the result csv.
SQL Report File
Every SQL run execution generates a SQL runner debug report file. This is a .sql
file.
It lists all the queries executed and status of the execution (Success or Fail). It also lists the name of the output file for that command and the time taken. If there are any failures, it will report the query and stop executing subsequent commands.
SQL Errors
While executing the series of SQL commands, one of the commands could fail (error, syntax, etc). In that case the app will quit and upload a SQL debug file to the project:
As you can see, it identifies the line with the SQL error and its response.
Now we can fix the query in the .sql
file and even use this report file as an input for a subsequent run -- picking up where it left off.
Last updated