auto loader using schema

Question

auto loader using schema

Vineet S 1,390

Hi,

how to create autoloader using below script and how it will trigger it

https://learn.microsoft.com/en-us/azure/databricks/delta/update-schema#enable-schema-evolution-for-writes-to-add-new-columns

i am using following source json

{ 
  "accounting" : [   
                     { "firstName" : "John",  
                       "lastName"  : "Doe",
                       "age"       : 23 },

                     { "firstName" : "Mary",  
                       "lastName"  : "Smith",
                        "age"      : 32 }
                 ],                            
  "sales"      : [ 
                     { "firstName" : "Sally", 
                       "lastName"  : "Green",
                        "age"      : 27 },

                     { "firstName" : "Jim",   
                       "lastName"  : "Galley",
                       "age"       : 41 }
                 ] 
}

how to create the schema for a nested json file

Chandra Boorla 14,665 Reputation points Microsoft External Staff Moderator

2024-12-02T02:27:45.53+00:00

@Vineet S

Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

0 additional answers

Your answer

Chandra Boorla 14,665 Reputation points Microsoft External Staff Moderator

2024-12-02T02:27:45.53+00:00

@Vineet S

Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi @Vineet S

Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!

To create an Auto Loader that handles the nested JSON structure you provided, you can use the following approach in Databricks. The Auto Loader can automatically infer the schema of the JSON data, including nested structures, and evolve the schema as new fields are introduced.

Set Up Auto Loader - Use the following code snippet to initialize Auto Loader to read your nested JSON data. This code will automatically infer the schema from the source JSON files.

Python:

(spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "parquet")
  # The schema location directory keeps track of your data schema over time
  .option("cloudFiles.schemaLocation", "<path-to-checkpoint>")
  .load("<path-to-source-data>")
  .writeStream
  .option("checkpointLocation", "<path-to-checkpoint>")
  .start("<path_to_target")
)

Scala:

spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "parquet")
  // The schema location directory keeps track of your data schema over time
  .option("cloudFiles.schemaLocation", "<path-to-checkpoint>")
  .load("<path-to-source-data>")
  .writeStream
  .option("checkpointLocation", "<path-to-checkpoint>")
  .start("<path_to_target")

cloudFiles.format - Set to "json" to specify the input format. cloudFiles.schemaLocation - A path where the schema information will be stored. mergeSchema - Set to "true" to allow schema evolution when new columns are added. checkpointLocation - A path for storing the checkpoint information to maintain the state of the stream. start - Specifies the target location where the processed data will be written.

Schema Inference - Auto Loader will automatically infer the schema of the nested JSON data by sampling the files. It will treat all columns as strings initially, but you can enable type inference by setting the option cloudFiles.inferColumnTypes to true.
Triggering the Stream - The stream will be triggered automatically based on the incoming data in the specified source path. Whenever new files are added to the source directory, Auto Loader will process them, infer the schema, and update the target location accordingly.
Handling Nested Data - If you have nested JSON objects, you may need to perform additional transformations after loading the data. You can use the semi-structured data access APIs to manipulate the nested structure as needed.

For additional information, please refer the below documentations:

Configure schema inference and evolution in Auto Loader

How does Auto Loader infer schema?

By following these steps, you can successfully create an Auto Loader for your nested JSON data and ensure that it processes incoming data efficiently.

I hope this information helps. Please do let us know if you have any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Chandra Boorla 14,665 Reputation points Microsoft External Staff Moderator

2024-12-03T01:10:55.5466667+00:00

@Vineet S

Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

auto loader using schema

0 additional answers

Your answer