JSON file
You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.
For further information, see JSON Files.
See the following Apache Spark reference articles for supported read and write options.
Note
This feature is supported in Databricks Runtime 8.2 (EoS) and above.
The rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the source file path of the record. To remove the source file path from the rescued data column, you can set the SQL configuration spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false")
. You can enable the rescued data column by setting the option rescuedDataColumn
to a column name, such as _rescued_data
with spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load(<path>)
.
The JSON parser supports three modes when parsing records: PERMISSIVE
, DROPMALFORMED
, and FAILFAST
. When used together with rescuedDataColumn
, data type mismatches do not cause records to be dropped in DROPMALFORMED
mode or throw an error in FAILFAST
mode. Only corrupt records—that is, incomplete or malformed JSON—are dropped or throw errors. If you use the option badRecordsPath
when parsing JSON, data type mismatches are not considered as bad records when using the rescuedDataColumn
. Only incomplete and malformed JSON records are stored in badRecordsPath
.
In this example, there is one JSON object per line:
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}
To read the JSON data, use:
val df = spark.read.format("json").load("example.json")
Spark infers the schema automatically.
df.printSchema
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
This JSON object occupies multiple lines:
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
To read this object, enable multi-line mode:
CREATE TEMPORARY VIEW multiLineJsonTable
USING json
OPTIONS (path="/tmp/multi-line.json",multiline=true)
val mdf = spark.read.option("multiline", "true").format("json").load("/tmp/multi-line.json")
mdf.show(false)
By default, the charset of input files is detected automatically. You can specify the charset explicitly using the charset
option:
spark.read.option("charset", "UTF-16BE").format("json").load("fileInUTF16.json")
Some supported charsets include: UTF-8
, UTF-16BE
, UTF-16LE
, UTF-16
, UTF-32BE
, UTF-32LE
, UTF-32
. For the full list of charsets supported by Oracle Java SE, see Supported Encodings.
The following notebook demonstrates single-line mode and multi-line mode.