Query dati variant

Importante

Questa funzionalità è disponibile in anteprima pubblica.

Questo articolo descrive come eseguire query e trasformare i dati semistrutturati archiviati come VARIANT. Il VARIANT tipo di dati è disponibile in Databricks Runtime 15.3 e versioni successive.

Databricks consiglia di usare VARIANT le stringhe JSON. Per gli utenti che usano attualmente stringhe JSON che cercano di eseguire la migrazione, vedere Differenze tra varianti rispetto alle stringhe JSON.

Per visualizzare esempi per l'esecuzione di query su dati semistrutturati archiviati con stringhe JSON, vedere Eseguire query su stringhe JSON.

Nota

Le colonne VARIANT non possono essere utilizzate per chiavi di clustering, partizioni o chiavi con ordine Z. Non è possibile utilizzare il tipo di dati VARIANT per le operazioni di confronto, raggruppamento, ordinamento e set. Per un elenco completo delle limitazioni, vedere limitazioni .

Creare una tabella con una colonna variante

Per creare una colonna variante, usare la parse_json funzione (SQL o Python).

Eseguire il comando seguente per creare una tabella con dati altamente annidati archiviati come VARIANT. Questi dati vengono usati in altri esempi in questa pagina.

SQL

-- Create a table with a variant column
CREATE TABLE store_data AS
SELECT parse_json(
  '{
    "store":{
        "fruit": [
          {"weight":8,"type":"apple"},
          {"weight":9,"type":"pear"}
        ],
        "basket":[
          [1,2,{"b":"y","a":"x"}],
          [3,4],
          [5,6]
        ],
        "book":[
          {
            "author":"Nigel Rees",
            "title":"Sayings of the Century",
            "category":"reference",
            "price":8.95
          },
          {
            "author":"Herman Melville",
            "title":"Moby Dick",
            "category":"fiction",
            "price":8.99,
            "isbn":"0-553-21311-3"
          },
          {
            "author":"J. R. R. Tolkien",
            "title":"The Lord of the Rings",
            "category":"fiction",
            "reader":[
              {"age":25,"name":"bob"},
              {"age":26,"name":"jack"}
            ],
            "price":22.99,
            "isbn":"0-395-19395-8"
          }
        ],
        "bicycle":{
          "price":19.95,
          "color":"red"
        }
      },
      "owner":"amy",
      "zip code":"94025",
      "fb:testid":"1234"
  }'
) as raw

SELECT * FROM store_data

Pitone

# Create a table with a variant column
store_data='''
{
  "store":{
    "fruit":[
      {"weight":8,"type":"apple"},
      {"weight":9,"type":"pear"}
    ],
    "basket":[
      [1,2,{"b":"y","a":"x"}],
      [3,4],
      [5,6]
    ],
    "book":[
      {
        "author":"Nigel Rees",
        "title":"Sayings of the Century",
        "category":"reference",
        "price":8.95
      },
      {
        "author":"Herman Melville",
        "title":"Moby Dick",
        "category":"fiction",
        "price":8.99,
        "isbn":"0-553-21311-3"
      },
      {
        "author":"J. R. R. Tolkien",
        "title":"The Lord of the Rings",
        "category":"fiction",
        "reader":[
          {"age":25,"name":"bob"},
          {"age":26,"name":"jack"}
        ],
        "price":22.99,
        "isbn":"0-395-19395-8"
      }
    ],
    "bicycle":{
      "price":19.95,
      "color":"red"
    }
  },
  "owner":"amy",
  "zip code":"94025",
  "fb:testid":"1234"
}
'''

# Create a DataFrame
df = spark.createDataFrame([(store_data,)], ["json"])

# Convert to a variant
df_variant = df.select(parse_json(col("json")).alias("raw"))

# Alternatively, create the DataFrame directly
# df_variant = spark.range(1).select(parse_json(lit(store_data)))

df_variant.display()

# Write out as a table
df_variant.write.saveAsTable("store_data")

Campi di query in una colonna variante

Per estrarre i campi da una colonna variante, usare la variant_get funzione (SQL o Python) specificando il nome del campo JSON nel percorso di estrazione. I nomi dei campi fanno sempre distinzione tra maiuscole e minuscole.

SQL

-- Extract a top-level field
SELECT variant_get(store_data.raw, '$.owner') AS owner FROM store_data

È anche possibile usare la sintassi SQL per eseguire query sui campi in una colonna variante. Vedere la sintassi abbreviata di SQL per variant_get.

Pitone

# Extract a top-level field
df_variant.select(variant_get(col("raw"), "$.owner", "string")).display()

Notazione abbreviata SQL per variant_get

La sintassi SQL per l'esecuzione di query su stringhe JSON e altri tipi di dati complessi in Azure Databricks si applica ai VARIANT dati, incluso quanto segue:

Usare : per selezionare i campi di primo livello.
Usare . o [<key>] per selezionare i campi annidati con chiavi denominate.
Usare [<index>] per selezionare i valori dalle matrici.

SELECT raw:owner FROM store_data

+-------+
| owner |
+-------+
| "amy" |
+-------+

-- Use backticks to escape special characters.
SELECT raw:`zip code`, raw:`fb:testid` FROM store_data

+----------+-----------+
| zip code | fb:testid |
+----------+-----------+
| "94025"  | "1234"    |
+----------+-----------+

Se un nome di campo contiene un punto (.), è necessario eseguirne l'escape con parentesi quadre ([ ]). Ad esempio, la query seguente seleziona un campo denominato zip.code:

SELECT raw:['zip.code'] FROM store_data

Estrarre campi annidati varianti

Per estrarre campi annidati da una colonna variante, specificarli usando la notazione a punti o le parentesi quadre. I nomi dei campi fanno sempre distinzione tra maiuscole e minuscole.

SQL

-- Use dot notation
SELECT raw:store.bicycle FROM store_data

-- Use brackets
SELECT raw:store['bicycle'] FROM store_data

Se non è possibile trovare un percorso, il risultato è NULL di tipo VARIANT.

Pitone

# Use dot notation
df_variant.select(variant_get(col("raw"), "$.store.bicycle", "string")).display()

# Use brackets
df_variant.select(variant_get(col("raw"), "$.store['bicycle']", "string")).display()

Se non è possibile trovare un percorso, il risultato è null di tipo VariantVal.

+-----------------+
| bicycle         |
+-----------------+
| {               |
| "color":"red",  |
| "price":19.95   |
| }               |
+-----------------+

Estrarre valori da matrici varianti

Per estrarre elementi da matrici, indicizzare con parentesi quadre. Gli indici sono basati su 0.

SQL

-- Index elements
SELECT raw:store.fruit[0], raw:store.fruit[1] FROM store_data

Pitone

# Index elements
df_variant.select((variant_get(col("raw"), "$.store.fruit[0]", "string")),(variant_get(col("raw"), "$.store.fruit[1]", "string"))).display()

+-------------------+------------------+
| fruit             | fruit            |
+-------------------+------------------+
| {                 | {                |
|   "type":"apple", |   "type":"pear", |
|   "weight":8      |   "weight":9     |
| }                 | }                |
+-------------------+------------------+

Se non è possibile trovare il percorso o se l'indice di matrice non è compreso nei limiti, il risultato è Null.

Uso delle varianti in Python

È possibile estrarre varianti dai DataFrame Spark in Python come VariantVal e manipolarle singolarmente con i metodi toPython e toJson.

# toPython
data = [
    ('{"name": "Alice", "age": 25}',),
    ('["person", "electronic"]',),
    ('1',)
]

df_person = spark.createDataFrame(data, ["json"])

# Collect variants into a VariantVal
variants = df_person.select(parse_json(col("json")).alias("v")).collect()

Restituisci il VariantVal come stringa JSON:

print(variants[0].v.toJson())

{"age":25,"name":"Alice"}

Convertire un oggetto VariantVal in un oggetto Python:

# First element is a dictionary
print(variants[0].v.toPython()["age"])

# Second element is a List
print(variants[1].v.toPython()[1])

electronic

# Third element is an Integer
print(variants[2].v.toPython())

È anche possibile costruire VariantVal usando la VariantVal.parseJson funzione .

# parseJson to construct VariantVal's in Python
from pyspark.sql.types import VariantVal

variant = VariantVal.parseJson('{"a": 1}')

Visualizza la variante come stringa JSON:

print(variant.toJson())

{"a":1}

Convertire la variante in un oggetto Python e stampare un valore:

print(variant.toPython()["a"])

Restituire lo schema di una variante

Per restituire lo schema di una variante, usare la schema_of_variant funzione (SQL o Python).

SQL

-- Return the schema of the variant
SELECT schema_of_variant(raw) FROM store_data;

Pitone

# Return the schema of the variant
df_variant.select(schema_of_variant(col("raw"))).display()

Per restituire gli schemi combinati di tutte le varianti in un gruppo, usare la schema_of_variant_agg funzione (SQL o Python).

Negli esempi seguenti viene restituito lo schema e quindi lo schema combinato per i dati json_datadi esempio .

SQL

CREATE OR REPLACE TEMP VIEW json_data AS
SELECT '{"name": "Alice", "age": 25}' AS json UNION ALL
SELECT '{"id": 101, "department": "HR"}' UNION ALL
SELECT '{"product": "Laptop", "price": 1200.50, "in_stock": true}';

-- Return the schema
SELECT schema_of_variant(parse_json(json)) FROM json_data;

Pitone


json_data = [
    ('{"name": "Alice", "age": 25}',),
    ('{"id": 101, "department": "HR"}',),
    ('{"product": "Laptop", "price": 1200.50, "in_stock": true}',)
]

df_item = spark.createDataFrame(json_data, ["json"])

# Return the schema
df_item.select(parse_json(col("json")).alias("v")).select(schema_of_variant(col("v"))).display()

+-----------------------------------------------------------------+
| schema_of_variant(v)                                            |
+-----------------------------------------------------------------+
| OBJECT<age: BIGINT, name: STRING>                               |
| OBJECT<department: STRING, id: BIGINT>                          |
| OBJECT<in_stock: BOOLEAN, price: DECIMAL(5,1), product: STRING> |
+-----------------------------------------------------------------+

SQL

-- Return the combined schema
SELECT schema_of_variant_agg(parse_json(json)) FROM json_data;

Pitone

# Return the combined schema
df.select(parse_json(col("json")).alias("v")).select(schema_of_variant_agg(col("v"))).display()

+----------------------------------------------------------------------------------------------------------------------------+
| schema_of_variant(v)                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------+
| OBJECT<age: BIGINT, department: STRING, id: BIGINT, in_stock: BOOLEAN, name: STRING, price: DECIMAL(5,1), product: STRING> |
+----------------------------------------------------------------------------------------------------------------------------+

Appiattire oggetti varianti e matrici

La variant_explode funzione generatore a valore di tabella (SQL o Python) può essere usata per appiattire matrici e oggetti varianti.

SQL

Poiché variant_explode è una funzione generatore, viene usata come parte della clausola FROM anziché nell'elenco SELECT, come negli esempi seguenti:

SELECT key, value
  FROM store_data,
  LATERAL variant_explode(store_data.raw);

SELECT pos, value
  FROM store_data,
  LATERAL variant_explode(store_data.raw:store.basket[0]);

Pitone

Usare l'API DataFrame con valori di tabella (TVF) per espandere una variante in più righe:

spark.tvf.variant_explode(parse_json(lit(store_data))).display()

# To explode a nested field, first create a DataFrame with just the field
df_store_col = df_variant.select(variant_get(col("raw"), "$.store", "variant").alias("store"))

# Perform the explode with a lateral join and the outer function to return the new exploded DataFrame
df_store_exploded_lj = df_store_col.lateralJoin(spark.tvf.variant_explode(col("store").outer()))
df_store_exploded = df_store_exploded_lj.drop("store")
df_store_exploded.display()

Regole di cast dei tipi varianti

È possibile archiviare matrici e scalari usando VARIANT il tipo . Quando si tenta di eseguire il cast di tipi varianti ad altri tipi, le regole di cast normali si applicano a singoli valori e campi, con le regole aggiuntive seguenti.

Nota

variant_get e try_variant_get accettare argomenti di tipo e seguire queste regole di cast.

Tipo di origine	Comportamento
`VOID`	Il risultato è di `NULL` tipo `VARIANT`.
`ARRAY<elementType>`	Deve `elementType` essere un tipo di cui è possibile eseguire il cast a `VARIANT`.

Quando si deduce il tipo con schema_of_variant o schema_of_variant_agg, le funzioni effettuano il fallback al tipo VARIANT anziché al tipo STRING quando sono presenti tipi in conflitto che non possono essere risolti.

SQL

Usare la funzione try_variant_get (SQL) per eseguire il cast.

-- price is returned as a double, not a string
SELECT try_variant_get(raw, '$.store.bicycle.price', 'double') as price FROM store_data

+------------------+
| price            |
+------------------+
| 19.95            |
+------------------+

È anche possibile usare :: o cast per eseguire il cast dei valori ai tipi di dati supportati:

-- cast into more complex types
SELECT cast(raw:store.bicycle AS STRUCT<price DOUBLE, color STRING>) bicycle FROM store_data;
-- `::` also supported
SELECT raw:store.bicycle::STRUCT<price DOUBLE, color STRING> bicycle FROM store_data;

+------------------+
| bicycle          |
+------------------+
| {                |
|   "price":19.95, |
|   "color":"red"  |
| }                |
+------------------+

Pitone

Usare la funzione try_variant_get (Python) per eseguire il cast:

# price is returned as a double, not a string
df_variant.select(try_variant_get(col("raw"), "$.store.bicycle.price", "double").alias("price"))

+------------------+
| price            |
+------------------+
| 19.95            |
+------------------+

Usare anche la try_variant_get funzione (SQL o Python) per gestire gli errori di cast:

SQL

SELECT try_variant_get(
  parse_json('{"a" : "c", "b" : 2}'),
  '$.a',
  'boolean'
)

Pitone

spark.range(1).select(parse_json(lit('{"a" : "c", "b" : 2}')).alias("v")).select(try_variant_get(col('v'), '$.a', 'boolean')).display()

Regole null varianti

Usare la is_variant_null funzione (SQL o Python) per determinare se il valore variant è una variante null.

SQL

Le varianti possono contenere due tipi di valori Null:

SQL NULL: SQL NULLindica che il valore è mancante. Questi sono gli stessi NULLdi quando si gestiscono dati strutturati.
Variant NULL: variant NULLs indicano che la variante contiene in modo esplicito un NULL valore. Non sono uguali a SQL NULL, perché il NULL valore viene archiviato nei dati.

SELECT
  is_variant_null(parse_json(NULL)) AS sql_null,
  is_variant_null(parse_json('null')) AS variant_null,
  is_variant_null(parse_json('{ "field_a": null }'):field_a) AS variant_null_value,
  is_variant_null(parse_json('{ "field_a": null }'):missing) AS missing_sql_value_null

+--------+------------+------------------+----------------------+
|sql_null|variant_null|variant_null_value|missing_sql_value_null|
+--------+------------+------------------+----------------------+
|   false|        true|              true|                 false|
+--------+------------+------------------+----------------------+

Pitone

data = [
    ('null',),
    (None,),
    ('{"field_a" : 1, "field_b" : 2}',)
]

df = spark.createDataFrame(data, ["null_data"])
df.select(parse_json(col("null_data")).alias("v")).select(is_variant_null(col("v"))).display()

+------------------+
|is_variant_null(v)|
+------------------+
|              true|
+------------------+
|             false|
+------------------+
|             false|
+------------------+

Commenti e suggerimenti

Questa pagina è stata utile?

Last updated on 2026-01-20

Condividi tramite

Query dati variant

Creare una tabella con una colonna variante

SQL

Pitone

Campi di query in una colonna variante

SQL

Pitone

Notazione abbreviata SQL per variant_get

Estrarre campi annidati varianti

SQL

Pitone

Estrarre valori da matrici varianti

SQL

Pitone

Uso delle varianti in Python

Restituire lo schema di una variante

SQL

Pitone

SQL

Pitone

SQL

Pitone

Appiattire oggetti varianti e matrici

SQL

Pitone

Regole di cast dei tipi varianti

SQL

Pitone

SQL

Pitone

Regole null varianti

SQL

Pitone

Commenti e suggerimenti

Risorse aggiuntive