# CSV Loader

{% hint style="info" %}
A license is required to access Spark functionality on the DNAnexus Platform. [Contact DNAnexus Sales](mailto:sales@dnanexus.com) for more information.
{% endhint %}

## Overview

The CSV Loader ingests CSV files into a database. The input CSV files are loaded into a Parquet-format database and tables that can be queried using Spark SQL.

You can load a single CSV file or many CSV files. In the many files case, all files must be syntactically equal.

For example:

* All files must have the same separator. This can be a comma, tab, or another consistent delimiter.
* All files must include a header line, or all files must exclude it

{% hint style="info" %}
Each CSV file is loaded into its own table within the specified database.
{% endhint %}

## How to Run CSV Loader

*Input:*

* CSV (array of CSV files to load into the database)

*Required Parameters:*

* `database_name` -> name of the database to load the CSV files into.
* `create_mode` -> `strict` mode creates database and tables from scratch and `optimistic` mode creates databases and tables if they do not already exist.
* `insert_mode` -> `append` appends data to the end of tables and `overwrite` is equivalent to truncating the tables and then appending to them.
* `table_name` -> array of table names, one for each corresponding CSV file by array index.
* `type` -> the cluster type, `"spark"` for Spark apps

*Other Options:*

* `spark_read_csv_header` -> default `false` -- whether the first line of each CSV should be used as column names for the corresponding table.
* `spark_read_csv_sep` -> default `,` -- the separator character used by each CSV.
* `spark_read_csv_infer_schema` -> default `false` -- whether the input schema should be inferred from the data.

### Basic Run

The following case creates a brand new database and loads data into two new tables:

```shell
dx run app-csv-loader \
   -i database_name=pheno_db \
   -i create_mode=strict \
   -i insert_mode=append \
   -i spark_read_csv_header=true \
   -i spark_read_csv_sep=, \
   -i spark_read_csv_infer_schema=true \
   -i csv=file-xxxx \
   -i table_name=sample_metadata \
   -i csv=file-yyyy \
   -i table_name=gwas_result
```
