How create dynamic JSON file (for bulk api upsert of elasticsearch) using data factory

Question

How create dynamic JSON file (for bulk api upsert of elasticsearch) using data factory

Clifford Gentiles 21

Hello, I am new to Azure Data Factory and I need to create a json file for bulk api upsert of elasticsearch with the following considerations;

input is in json format which will be used as payload for upsert api, each row consists of an array and objects
I need to create a dynamic json output, 2 rows of output for each row of input, please see sample below
output json file should end with newline (\n)
I have tested the bulk api upsert using postman, I just need to create it dynamically using either pipeline activity, dataflow or pyspark notebook

Also I am open to the possibility of editting the dataflow that created the input json file (which sources from a parquet file) to make the desired output json.

Sample:
input json:

    row1: {"array1":["value1a","value1b"],"object2":"value2","object3":"value3","object4":"value4","object5":"value5"}  
    row2: {"array6":["value6a","value6b"],"object7":"value7","object8":"value8","object9":"value9","object10":"value10"}

output json:

    {"update": {"_id": <object2_object3>, "_index": <constant literal>}}       <--- need to get the objects from input row1  
    {"doc": <whole input json row1>, "doc_as_upsert" : true}                      <--- will use the whole input row1  
    {"update": {"_id": <object7_object8>, "_index": <constant literal>}}       <--- need to get the objects from input row2  
    {"doc": <whole input json row2>, "doc_as_upsert" : true}                       <--- will use the whole input row2  
                                                                                                                     <--- should have empty line at the end

Thanks for your help guys

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2022-11-16T06:38:16.36+00:00

Hi @Clifford Gentiles ,

Welcome to Microsoft Q&A platform and thanks for posting your question here.

As I understand your ask, you want to transform the JSON data to the desired format as mentioned in the question.

Could you please help me understand what you mean by '<constant literal>' in the key value: "_index": <constant literal> .

Clifford Gentiles 21

Hi @AnnuKumari-MSFT

Thanks for replying, what I meant is, it will be the name of our index (eg. "index_name") which is constant throughout the whole desired output json. Also worth mentioning that the <whole input json row1> will still be an object, so basically it will be an object within the object. I just can't edit the question to add { } enclosing <whole input json row1>. It should be;

     {"update": {"_id": <object2_object3>, "_index": <constant literal>}}       <--- need to get the objects from input row1  
     {"doc": {<whole input json row1>}, "doc_as_upsert" : true}                      <--- will use the whole input row1  
     {"update": {"_id": <object7_object8>, "_index": <constant literal>}}       <--- need to get the objects from input row2  
     {"doc": {<whole input json row2>}, "doc_as_upsert" : true}                       <--- will use the whole input row2  
                                                                                                                  <--- should have empty line at the end

Hope we do find a solution to this,

Thanks

1 answer

Your answer

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2022-11-16T06:38:16.36+00:00

Hi @Clifford Gentiles ,

Welcome to Microsoft Q&A platform and thanks for posting your question here.

As I understand your ask, you want to transform the JSON data to the desired format as mentioned in the question.

Could you please help me understand what you mean by '<constant literal>' in the key value: "_index": <constant literal> .
Clifford Gentiles 21 Reputation points

2022-11-16T06:44:39.02+00:00

Hi @AnnuKumari-MSFT

Thanks for replying, what I meant is, it will be the name of our index (eg. "index_name") which is constant throughout the whole desired output json. Also worth mentioning that the <whole input json row1> will still be an object, so basically it will be an object within the object. I just can't edit the question to add { } enclosing <whole input json row1>. It should be;

{"update": {"_id": <object2_object3>, "_index": <constant literal>}} <--- need to get the objects from input row1 {"doc": {<whole input json row1>}, "doc_as_upsert" : true} <--- will use the whole input row1 {"update": {"_id": <object7_object8>, "_index": <constant literal>}} <--- need to get the objects from input row2 {"doc": {<whole input json row2>}, "doc_as_upsert" : true} <--- will use the whole input row2 <--- should have empty line at the end

Hope we do find a solution to this,

Thanks

Answer 1

AnnuKumari-MSFT 34,556 Microsoft Employee Moderator

Hi @Clifford Gentiles ,

Welcome to Microsoft Q&A platform and thanks for posting your question here.

Thank you for responding on additional details required. In the requirement , there are few things which we need to modify in order to make it feasible to achieve the same.

1. The input data is combination of two JSONs , however the two jsons are not enclosed within a JSON. In order to treat this as a JSON file, we need to convert the two JSON into nested JSON :

{"row1":{"array1":["value1a","value1b"],"object2":"value2","object3":"value3","object4":"value4","object5":"value5"}, "row2":{"array6":["value6a","value6b"],"object7":"value7","object8":"value8","object9":"value9","object10":"value10"}}

2. The expected output is again a combination of 4 JSONs . But if you check in any JSON validator it will throw error as it is not enclosed in a key value pair structure. So, we should try to transform the data to a valid JSON format. Please let me know if this output format is good to meet the requirement or not:

{"json":{"json1":{"update":{"_id":"value2_value3","_index":1},"docJSON":{"doc":{"array1":["value1a","value1b"],"object2":"value2","object3":"value3","object4":"value4","object5":"value5"},"doc_as_upsert":"true"}},"json2":{"json2":{"update":{"_id":"value7_value8","_index":1},"docJSON":{"doc":{"array6":["value6a","value6b"],"object7":"value7","object8":"value8","object9":"value9","object10":"value10"},"doc_as_upsert":"true"}}}}}

Kindly follow the below steps in mapping dataflow:
1. Add source transformation pointing the data to the input JSON datasource. Select 'Array of document' in the document form in Source options tab.

2. Add Derived column transformation to create a column called JSON with following expression: @(json1=@(update=@({_id}=row1.object2 + '_' + row1.object3, {_index}=1), docJSON=@(doc=row1, doc_as_upsert='true')), json2=@(json2=@(update=@({_id}=row2.object7 + '_' + row2.object8, {_index}=1), docJSON=@(doc=row2, doc_as_upsert='true'))))

3. Add select transformation to deselect row1 and row2 and only keep JSON column
4. Add sink transformation and create JSON dataset and provide 'Output to single file' in 'File name option' . Provide file name in 'Output to single file' textbox. Set single partition in optimize tab.
5. Create a new ADF pipeline and call the dataflow and execute it.

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.
Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Clifford Gentiles 21 Reputation points

2022-11-16T13:29:23.907+00:00

Hi @AnnuKumari-MSFT

Thanks for this, I will use this on my test data. Can you confirm that this is still applicable for input json file that has 1M rows with different number of objects per row? I asked this because I noticed that the output json on your demo is just a single array. Does this have any limitation like input row number limit? Again thank you for the help, I will let you know for any updates. Cheers
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2022-11-16T14:09:39.433+00:00

Hi @Clifford Gentiles ,
Thankyou for the followup query. Yes, Even for large number of rows, the data should be a valid JSON data else JSON dataset won't accept it. You can validate the JSON in any online validator like : https://jsonformatter.curiousconcept.com/#

Additionally, I want to add that it would be a difficult task to create the expression for 1M rows as keys in JSON are different for each rows as in your case first row consists of ID value as 'object2_object3' , similarly, second row having 'object7_object8'.

------------------------------

Hope it helps. If the answer helped, kindly accept it by clicking on Accept answer button.
Clifford Gentiles 21 Reputation points

2022-11-16T14:21:44.117+00:00

Hi @AnnuKumari-MSFT

I'm sorry for any confusion, my keys for ID throughout the whole input json is the same set of concatenated objects. Also I tried your demo but I can't make my input json with few thousand rows into a single array document, it resulted to two rows with the same objects, kinda like it just selected 2 rows from my input without any transformation. I haven't proceeded on the derived column part yet. Please see image below.

Thanks
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2022-11-16T14:34:39.293+00:00

Hi @Clifford Gentiles ,
Could you please share the original json data as a .txt file in the attachment here so that we can provide a better help. Thankyou for your efforts in advance
Clifford Gentiles 21 Reputation points

2022-11-16T14:56:31.177+00:00

Hi @AnnuKumari-MSFT

Due to the confidentiality of the data, I won't be able to provide it to you, but I have attached a similar one. Please see attachment. Thanks again for helping me

261005-bulk-api-elasticsearch-notes.txt
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2022-12-14T06:29:08.44+00:00

Hi @Clifford Gentiles ,
Apologies for delay in response. I had a look at the sample data you shared. Even that doesn't look like a valid JSON. For the above query, I was able to import the json you provided at the question level by tweaking it a bit as mentioned above since it was not in a proper JSON format. Kindly use Azure function or custom code to achieve the requirement if it's not possible to have proper json at the source, since dataflow completely works based on source file schema. Thankyou

Hope it helps. Please do consider clicking Accept Answer and Up-Vote for the same as accepted answers help community as well.

Share via

How create dynamic JSON file (for bulk api upsert of elasticsearch) using data factory

1 answer

Your answer