how to flatten the nested json array into csv file using ADF pipeline
how to flatten the nested json array file into csv file using ADF pipeline. using below example i need to get the each table details in the excel/csv file. need the file like these below using below ex json
Ex:
{
"Table1": {
"lastSuccessful": "12467",
"totalCount": 11,
"dataPath": "s3://ddwe-dev-3-gwcpdev-gw-52/qa/pc/171897363360/Table1",
"schemaHistory": {
"1see4c4fe3ac4830a9cd09e4rt4a": "171897363360"
}
},
"Table2": {
"lastSuccessful": "12467",
"totalCount": 11,
"dataPath": "s3://ddwe-dev-3-gwcpdev-gw-52/qa/pc/171897363360/Table2",
"schemaHistory": {
"1see4c4fe3ac4830a9cd09e4rt4a": "171897363360"
}
},
Azure Data Factory
-
phemanth 10,335 Reputation points • Microsoft Vendor
2024-03-28T09:58:52.8233333+00:00 @Bommisetty, Rakesh (Hudson IT Consultant)
Thanks for reaching out to Microsoft Q&A.
Here's a detailed guide incorporating the best aspects of previous responses and addressing potential issues:
Steps:
- Create an ADF Pipeline: In the Azure portal, navigate to your Data Factory service. Click on "Pipelines" and then "Create pipeline." Give your pipeline a descriptive name (e.g., "FlattenNestedJsonToCsv").
- Add a Data Flow Activity: Drag and drop a "Data Flow" activity from the "Activities" pane onto the designer canvas.
- Configure the Source Dataset: Double-click the "Data Flow" activity. Click on "Source" and choose the data source where your nested JSON file resides. This could be Azure Blob Storage, ADLS Gen2, or another supported source. Select the appropriate format (e.g., JSON) and configure the file path or connection details.
- Add a Flatten Transformation: Drag and drop a "Flatten" transformation from the "Transformations" pane onto the data flow canvas. Connect the source dataset to the "Flatten" transformation.
- Configure the Flatten Transformation: Double-click the "Flatten" transformation. In the "Unroll by" section, select the nested array you want to flatten. In your example, this would be the top-level object with keys like "Table1" and "Table2." Optionally, under "Output Columns," you can: Rename columns for better readability in the CSV. Select specific nested properties within the arrays to include in the output.
follow the link :https://learn.microsoft.com/en-us/azure/data-factory/data-flow-flatten - Add a Sink Dataset: Drag and drop a "Sink" dataset from the "Transformations" pane onto the data flow canvas. Connect the "Flatten" transformation to the "Sink" dataset.
- Configure the Sink Dataset: Double-click the "Sink" dataset. Choose the data store where you want to save the flattened CSV file. This could be Azure Blob Storage, ADLS Gen2, or another supported destination. Specify the file format as "CSV" and configure the file path or connection details. Set the appropriate delimiter (e.g., comma) for your CSV file.
- Preview Data and Publish: Click on "Run" at the top of the data flow canvas to preview the flattened data. This helps ensure the transformation works as expected. Once satisfied, click "Publish" to publish your ADF pipeline.Hope this helps. Do let us know if you any further queries.
-
phemanth 10,335 Reputation points • Microsoft Vendor
2024-03-29T07:15:14.22+00:00 @Bommisetty, Rakesh (Hudson IT Consultant) We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
-
Bommisetty, Rakesh (Hudson IT Consultant) 170 Reputation points
2024-04-01T04:47:37.9766667+00:00 the Flatten transformation is not working due {...} and "unroll by" option is not highlighting with the table names it was disabled that's the reason i mention the example also. can you please try with my example?
-
phemanth 10,335 Reputation points • Microsoft Vendor
2024-04-01T11:43:46.49+00:00 @Bommisetty, Rakesh (Hudson IT Consultant)You're absolutely right. The ellipsis (
...
) in your JSON example indicates there might be additional nested structures within the "Table1" and "Table2" objects. The "Unroll by" option in the Flatten transformation only works if you're directly flattening an array at the top level.Here's how to adjust your ADF pipeline to handle the nested structure in your specific JSON:
1. Use Nested Flatten Transformation:
- Instead of a single Flatten transformation, add two nested Flatten transformations.
2. First Flatten Transformation:
- Configure the first Flatten transformation with: Unroll by: Select the root array name ("GoCopy").
3. Second Flatten Transformation for Each Table Object:
- After the first Flatten transformation, add a separate Flatten transformation for each table object ("Table1" and "Table2").
- For each nested Flatten transformation: Source: Connect it to the output of the previous Flatten transformation. Unroll by: Select the specific table object name (e.g., "Table1" for the first nested Flatten and "Table2" for the second).
4. Select Transformation (Optional):
- If you only need specific data from each table object, add a "Select" transformation after each nested Flatten transformation.
- Choose the desired fields (e.g., "lastSuccessful", "totalCount", etc.) in the "Selected fields" section.
5. Sink Transformation (CSV Output):
- Finally, add a "Sink" transformation after the last nested Flatten or Select transformation (depending on your choice).
- Configure the Sink settings for your CSV output format.
Here's a revised breakdown of the pipeline structure:
Source -> Flatten ("GoCopy") -> Flatten ("Table1") -> Optional: Select -> | | (similar path for "Table2") V Sink (CSV)
By using nested Flatten transformations, you'll be able to unroll each table object within the root array and then optionally select specific fields before writing them to the CSV file. This approach effectively addresses the limitation of the "Unroll by" option with nested structures.
-
Bommisetty, Rakesh (Hudson IT Consultant) 170 Reputation points
2024-04-02T03:41:10.3333333+00:00 the Flatten transformation is not helping in my case still showing un-highlighted objects in unroll by option. Can you please use the same example what i provided and let me know.
-
phemanth 10,335 Reputation points • Microsoft Vendor
2024-04-02T11:49:50.37+00:00 Hi @Bommisetty, Rakesh (Hudson IT Consultant)
It looks like the way your JSON data is organized is not quite like a list, but more like separate sections (Table1, Table2, etc.), each with their own details like lastSuccessful, totalCount, dataPath, and schemaHistory.
The tool we're using, Azure Data Factory’s Mapping Data Flow, can help flatten arrays in JSON, but it might struggle with complex nested structures like yours. There's an option called "unroll by" that helps with flattening arrays, but it might not work if the tool doesn't recognize your data as an array.
Could you please confirm if you still want to flatten this data? You mentioned you wanted to flatten a JSON array, but it seems your data structure is a bit different.
-
Bommisetty, Rakesh (Hudson IT Consultant) 170 Reputation points
2024-04-02T14:00:45.5966667+00:00 Hi Hemanth,
it's not necessary to use flatten a nested json but another activity thing we can use.
-
phemanth 10,335 Reputation points • Microsoft Vendor
2024-04-03T14:18:35.22+00:00 @Bommisetty, Rakesh (Hudson IT Consultant)
Here are two options you can consider:
1. Select Activity:
- This approach leverages the Select activity in your data flow to manipulate the JSON data without using the Flatten transformation.
Steps:
- After adding the JSON source to the data flow, drag and drop a "Select" activity.
- Connect the JSON source to the Select activity.
- In the Select activity, configure the output schema: Create new columns for each desired element from the top level (e.g., "Table1", "Table2"). For each top-level element column, use dynamic content to access its nested properties. For example, to access "lastSuccessful" under "Table1", use the expression
item()['Table1']['lastSuccessful']
.
"select": [ "Table1",{ "Table2", { "name": "Table1_lastSuccessful", "expression": "item()['Table1']['lastSuccessful']" }, { "name": "Table1_totalCount", "expression": "item()['Table1']['totalCount']" }, // ... Add expressions for other desired properties ] }
2. Derived Column Activity:
- This approach utilizes the Derived Column activity to create new columns based on expressions that extract data from the nested JSON.
Steps:
- After adding the JSON source to the data flow, drag and drop a "Derived Column" activity.
- Connect the JSON source to the Derived Column activity.
- In the Derived Column activity, create new columns for each desired element from the top level (similar to the Select activity).
- Use dynamic content expressions within the derived column definitions to access nested properties.
{ "columns": [ { "name": "Table1", "dataType": "string", "defaultValue": null }, { "name": "Table2", "dataType": "string", "defaultValue": null }, { "name": "Table1_lastSuccessful", "dataType": "string", "expression": "item()['Table1']['lastSuccessful']" }, { "name": "Table1_totalCount", "dataType": "int64", "expression": "item()['Table1']['totalCount']" }, // ... Add expressions for other desired properties ] }
Both approaches achieve the same outcome of extracting data from the nested JSON without using the Flatten transformation. The best choice depends on your specific needs and familiarity with ADF features:
- If you prefer a visual approach with pre-built options, the Select activity might be easier to work with.
- If you have more complex expressions or prefer programmatic control, the Derived Column activity could be a better fit.
Sign in to comment