CSV Loader

A license is required to access Spark functionality on the DNAnexus Platform. Contact DNAnexus Sales for more information.

Overview

The CSV Loader ingests CSV files into a database. The input CSV files are loaded into a Parquet-format database and tables that can be queried using Spark SQL.

You can load a single CSV file or many CSV files. In the many files case, all files must be syntactically equal.

For example:

  • All files must have the same separator (e.g. comma, tab)

  • All files must include a header line, or all files must exclude it

NOTE: Each CSV file is loaded into its own table within the specified database.

How to Run CSV Loader

Input:

  • CSV (array of CSV files to load into the database)

Required Parameters:

  • database_name -> name of the database to load the CSV files into.

  • create_mode -> strict mode creates database and tables from scratch and optimistic mode creates databases and tables if they do not already exist.

  • insert_mode -> append appends data to the end of tables and overwrite is equivalent to truncating the tables and then appending to them.

  • table_name -> array of table names, one for each corresponding CSV file by array index.

  • type -> the cluster type, "spark" for Spark apps

Other Options:

  • spark_read_csv_header -> default false -- whether the first line of each CSV should be used as column names for the corresponding table.

  • spark_read_csv_sep -> default , -- the separator character used by each CSV.

  • spark_read_csv_infer_schema -> default false -- whether the input schema should be inferred from the data.

Basic Run

The following case creates a brand new database and loads data into two new tables:

dx run app-csv-loader \
   -i database_name=pheno_db \
   -i create_mode=strict \
   -i insert_mode=append \
   -i spark_read_csv_header=true \
   -i spark_read_csv_sep=, \
   -i spark_read_csv_infer_schema=true \
   -i csv=file-xxxx \
   -i table_name=sample_metadata \
   -i csv=file-yyyy \
   -i table_name=gwas_result

Last updated