Handle bad records and files
Azure Databricks provides a number of options for dealing with files that contain bad records. Examples of bad data include:
- Incomplete or corrupt records: Mainly observed in text based file formats like JSON and CSV. For example, a JSON record that doesn’t have a closing brace or a CSV record that doesn’t have as many columns as the header or first record of the CSV file.
- Mismatched data types: When the value for a column doesn’t have the specified or inferred data type.
- Bad field names: Can happen in all file formats, when the column name specified in the file or record has a different casing than the specified or inferred schema.
- Corrupted files: When a file cannot be read, which might be due to metadata or data corruption in binary file types such as Avro, Parquet, and ORC. On rare occasion, might be caused by long-lasting transient failures in the underlying storage system.
- Missing files: A file that was discovered during query analysis time and no longer exists at processing time.
When you set badRecordsPath
, the specified path records exceptions for bad records or files encountered during data loading.
In addition to corrupt records and files, errors indicating deleted files, network connection exception, IO exception, and so on are ignored and recorded under the badRecordsPath
.
Note
Using the badRecordsPath
option in a file-based data source has a few important limitations:
- It is non-transactional and can lead to inconsistent results.
- Transient errors are treated as failures.
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.format("parquet").load("/input/parquetFile")
// Delete the input parquet file '/input/parquetFile'
dbutils.fs.rm("/input/parquetFile")
df.show()
In the above example, since df.show()
is unable to find the input file, Spark creates an exception file in JSON format to record the error. For example, /tmp/badRecordsPath/20170724T101153/bad_files/xyz
is the path of the exception file. This file is under the specified badRecordsPath
directory, /tmp/badRecordsPath
. 20170724T101153
is the creation time of this DataFrameReader
. bad_files
is the exception type. xyz
is a file that contains a JSON record, which has the path of the bad file and the exception/reason message.
// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.format("text").save("/tmp/input/jsonFile")
val df = spark.read
.option("badRecordsPath", "/tmp/badRecordsPath")
.schema("a int, b int")
.format("json")
.load("/tmp/input/jsonFile")
df.show()
In this example, the DataFrame contains only the first parsable record ({"a": 1, "b": 2}
). The second bad record ({bad-record
) is recorded in the exception file, which is a JSON file located in /tmp/badRecordsPath/20170724T114715/bad_records/xyz
. The exception file contains the bad record, the path of the file containing the record, and the exception/reason message. After you locate the exception files, you can use a JSON reader to process them.