How is variant different than JSON strings?
Important
This feature is in Public Preview.
This article describes the behavior changes and differences in syntax and semantics when working with the variant data type. This article assumes that you are familiar with working with JSON string data on Azure Databricks. For users new to Azure Databricks, you should use variant over JSON strings whenever storing semi-structured data that requires flexibility for changing or unknown schema. See Model semi-structured data.
In Databricks Runtime 15.3 and above, you can use the variant data type to encode and query semi-structured data. Databricks recommends variant as a replacement for storing semi-structured data using JSON strings. The improved read and write performance for variant allows it to replace native Spark complex types such as structs and arrays in some use cases.
How do you query variant data?
Variant data uses the same operators to query fields, subfields, and array elements.
To query a field, use :
. For example, column_name:field_name
.
To query a subfield, use .
. For example, column_name:field_name.subfield_name
.
To query an array element, use [n]
where n
is the integer index value of the element. For example, to query the first value in an array, column_name:array_name[0]
.
The following differences might break existing queries when upgrading from JSON strings to variant:
- All variant path elements are matched in a case-sensitive way. JSON strings are case-insensitive. This means that for variant,
column_name:FIELD_NAME
andcolumn_name:field_name
look for different fields in the stored data. - The
[*]
syntax is not support for identifying or unpacking all elements in an array. - Variant encodes
NULL
values differently than JSON strings. See Variant null rules.
Convert JSON strings to and from variant
In Databricks Runtime 15.3 and above, the to_json
function has additional functionality to cast VARIANT
types to JSON strings. Options are ignored when converting VARIANT
to JSON string. See to_json.
The parse_json
function transforms a JSON string to VARIANT
type. While parse_json(json_string_column)
is the logical inverse of to_json(variant_column)
, the following conversion rules describe why it is not the exact inverse:
- Whitespace is not perfectly preserved.
- Ordering of keys is arbitrary.
- Trailing zeros in numbers might be truncated.
The parse_json
function returns an error if the JSON string is malformed or exceeds the variant size limit. Use the try_parse_json
function to instead return a NULL
when an error in parsing occurs.
What are the SQL functions for working with variants?
Apache Spark SQL functions available in Databricks Runtime 15.3 and above provide methods for interacting with variant data. The following table includes the new function, the corresponding JSON string function, and notes on differences in behavior.
Note
To use these functions with PySpark DataFrames, import them from pyspark.sql.functions
. variant_explode
and variant_explode_outer
are not supported in PySpark.
Variant function | JSON string function | Notes |
---|---|---|
variant_get | cast and get_json_object | Takes an expression, path, and type. Follows all rules for variants paths, casting, and nulls. |
try_variant_get | try_cast and get_json_object | Takes an expression, path, and type. Follows all rules for variants paths, casting, and nulls. |
is_variant_null | is null | Checks whether the expression is storing a VARIANT encoded NULL . Use is null to check if the input expression is NULL . |
schema_of_variant | schema_of_json | When determining the schema for an ARRAY<elementType> , the elementType might be inferred as VARIANT if there are conflicting types found in the data. |
schema_of_variant_agg | schema_of_json_agg | When no least common type is identified, the type is derived as VARIANT . |
variant_explode | explode | Outputs pos , key , and value columns. When exploding an array, the output key is always null. |
variant_explode_outer | explode_outer | Outputs pos , key , and value columns. When exploding an array, the output key is always null. |
Variants handle casting and NULL
s differently than JSON strings. See Variant type casting rules and Variant null rules.