dxapp.json
with the following syntax:type
string the cluster type, "dxspark"
for an Apollo Spark cluster or "apachespark"
for generic Spark.version
string Requested version for dxspark
or apachespark
clusters. Supported values are [2.4.4
, 3.2.0
]initialInstanceCount
integer the number of nodes in the cluster, including the driver node. It should be at least 1
.ports
: (Optional) Comma separated list of ports (or port range) to be opened between the nodes of the cluster.bootstrapScript
: (Optional) Path to the bootstrap script. Bootstrap Script runs on all nodes of the cluster before application code. It's recommended that the script be located at the same location as the application code.dxspark
apachespark
Spark Master
: Driver (formerly master) is most important component of Spark cluster. It runs all the important services required to keep the cluster functioning like spark master service
, spark driver
, hdfs namenode
, hdfs datanode
. Application code runs on this node. 2. One or more Spark Slaves
: ClusterWorker (formerly slave) nodes run Spark ClusterWorker service. Spark executor processes are started in these nodes which process the data.assetDepends
, bundleDepends
and files under resources/
dxspark
apps should have network access to be able to connect to the DNAnexus Hive Metastore. To be able to read databases from the parent project, app(let)s should request VIEW access to the project. To create new databases or write into existing databases in the parent project, app(let)s should request UPLOAD access to the project.dxapp.json
:dxspark
or apachespark
. Additional ports to be opened can be specified in clusterSpec
. /resources/cluster/hooks/prebootstrap.sh
(the resources
folder being the app folder that gets extracted over the instances root directory, so /cluster/
folder will be in the root directory of the app instance). dxspark
package. If the pre-bootstrap script returns a non-zero exit code, the setup of the app will fail and the instance will terminate. In the case of a multi-node cluster configuration, if setup on a child node fails, it will cause the child node to terminate and another to be spun up to take its place, resulting in the pre-bootstrap script being tried again.$PRE_BOOTSTRAP_LOG
(default value of /cluster/prebootstrap.log
). Below is an example for how to display it from within your app's startup script:{"WARN", "INFO", "DEBUG", "TRACE"}
. By default, the log levels are WARN, but at times it may be desirable to see more detailed logs for debugging.spark-submit
call.pyspark_app.py
in the previous examples), a Spark session must be instantiated, and it must have Hive support.'dnax://'
-- this is the custom scheme that must be used for all DNAnexus databases in order to integrate with the DNAnexus platform.https://job-xxxx.dnanexus.cloud:8081/jobs/
​
tmp/clusterLogs/eventlogs/ -
you will see a file. Copy this to let's say /userName/sparkUIEventLogs/
/userName/histProperties.txt
, which has this single line spark.history.fs.logDirectory=file:/userName/sparkUIEventLogs
$SPARK_HOME/sbin/start-history-server.sh --properties-file /userName/histProperties.txt
. Now go to localhost:18080 in your browser and check out the jobs, stages, executor usage And you can add more files to the /userName/sparkUIEventLogs/
and the history server will pick them up.dxapp.json
:bash_app.sh
:pyspark_app.py
: