Read CSV files

This article provides examples for reading CSV files with Azure Databricks using Python, Scala, R, and SQL.

Note

Databricks recommends the read_files table-valued function for SQL users to read CSV files. read_files is available in Databricks Runtime 13.3 LTS and above.

You can also use a temporary view. If you use SQL to read CSV data directly without using temporary views or read_files, the following limitations apply:

Options

You can configure several options for CSV file data sources. See the following Apache Spark reference articles for supported read options:

This article only covers reading CSV, but you can learn about supported write options in the following Apache Spark reference articles:

Work with malformed CSV records

When reading CSV files with a specified schema, it is possible that the data in the files does not match the schema. For example, a field containing name of the city will not parse as an integer. The consequences depend on the mode that the parser runs in:

  • PERMISSIVE (default): nulls are inserted for fields that could not be parsed correctly
  • DROPMALFORMED: drops lines that contain fields that could not be parsed
  • FAILFAST: aborts the reading if any malformed data is found

To set the mode, use the mode option.

diamonds_df = (spark.read
  .format("csv")
  .option("mode", "PERMISSIVE")
  .load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
)

In the PERMISSIVE mode it is possible to inspect the rows that could not be parsed correctly using one of the following methods:

  • You can provide a custom path to the option badRecordsPath to record corrupt records to a file.
  • You can add the column _corrupt_record to the schema provided to the DataFrameReader to review corrupt records in the resultant DataFrame.

Note

The badRecordsPath option takes precedence over _corrupt_record, meaning that malformed rows written to the provided path do not appear in the resultant DataFrame.

Default behavior for malformed records changes when using the rescued data column.

Find malformed rows notebook

Get notebook

Rescued data column

Note

This feature is supported in Databricks Runtime 8.3 (unsupported) and above.

When using the PERMISSIVE mode, you can enable the rescued data column to capture any data that wasn’t parsed because one or more fields in a record have one of the following issues:

  • Absent from the provided schema.
  • Does not match the data type of the provided schema.
  • Has a case mismatch with the field names in the provided schema.

The rescued data column is returned as a JSON document containing the columns that were rescued, and the source file path of the record. To remove the source file path from the rescued data column, you can set the SQL configuration spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false"). You can enable the rescued data column by setting the option rescuedDataColumn to a column name when reading data, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("csv").load(<path>).

The CSV parser supports three modes when parsing records: PERMISSIVE, DROPMALFORMED, and FAILFAST. When used together with rescuedDataColumn, data type mismatches do not cause records to be dropped in DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or malformed CSV—are dropped or throw errors.

When rescuedDataColumn is used in PERMISSIVE mode, the following rules apply to corrupt records:

  • The first row of the file (either a header row or a data row) sets the expected row length.
  • A row with a different number of columns is considered incomplete.
  • Data type mismatches are not considered corrupt records.
  • Only incomplete and malformed CSV records are considered corrupt and recorded to the _corrupt_record column or badRecordsPath.

SQL example: Read CSV file

The following SQL example reads a CSV file using read_files.

-- mode "FAILFAST" aborts file parsing with a RuntimeException if malformed lines are encountered
SELECT * FROM read_files(
  's3://<bucket>/<path>/<file>.csv',
  format => 'csv',
  header => true,
  mode => 'FAILFAST')

Scala, R, and Python examples: Read CSV file

The following notebook shows how to read a file, display sample data, and print the data schema using Scala, R, and Python. The examples in this section use the diamonds dataset. Specify the path to the dataset as well as any options that you would like.

Read CSV files notebook

Get notebook

Example: Specify schema

When the schema of the CSV file is known, you can specify the desired schema to the CSV reader with the schema option.

Read CSV files with schema notebook

Get notebook

SQL example using read_files:

SELECT * FROM read_files(
  's3://<bucket>/<path>/<file>.csv',
  format => 'csv',
  header => false,
  schema => 'id string, date date, event_time timestamp')

Example: Pitfalls of reading a subset of columns

The behavior of the CSV parser depends on the set of columns that are read. If the specified schema is incorrect, the results might differ considerably depending on the subset of columns that is accessed. The following notebook presents the most common pitfalls.

Caveats of reading a subset of columns of a CSV file notebook

Get notebook