CSV Loader
Overview
The CSV Loader ingests CSV files into a database. The input CSV files are loaded into a Parquet-format database and tables that can be queried using Spark SQL.
You can load a single CSV file or many CSV files. In the many files case, all files must be syntactically equal.
For example:
All files must have the same separator. This can be a comma, tab, or another consistent delimiter.
All files must include a header line, or all files must exclude it
How to Run CSV Loader
Input:
CSV (array of CSV files to load into the database)
Required Parameters:
database_name-> name of the database to load the CSV files into.create_mode->strictmode creates database and tables from scratch andoptimisticmode creates databases and tables if they do not already exist.insert_mode->appendappends data to the end of tables andoverwriteis equivalent to truncating the tables and then appending to them.table_name-> array of table names, one for each corresponding CSV file by array index.type-> the cluster type,"spark"for Spark apps
Other Options:
spark_read_csv_header-> defaultfalse-- whether the first line of each CSV should be used as column names for the corresponding table.spark_read_csv_sep-> default,-- the separator character used by each CSV.spark_read_csv_infer_schema-> defaultfalse-- whether the input schema should be inferred from the data.
Basic Run
The following case creates a brand new database and loads data into two new tables:
dx run app-csv-loader \
-i database_name=pheno_db \
-i create_mode=strict \
-i insert_mode=append \
-i spark_read_csv_header=true \
-i spark_read_csv_sep=, \
-i spark_read_csv_infer_schema=true \
-i csv=file-xxxx \
-i table_name=sample_metadata \
-i csv=file-yyyy \
-i table_name=gwas_resultLast updated
Was this helpful?