Read json file which is generated from ravendb export which has duplicate columns

Shailendra Kad 11

Hi Team,

I want to load the json file generated from ravendb export.
This is rather complex file and has lot of arrays and strings in it.
Only issue is, it has 2 columns which are duplicate.
I mean ideally this json is not valid , as it has 2 columns which are present in the file multiple times.
Sample structure as below
Docs[]
Attachments
Docs[]
Attachments
Indexes[]
Transformers[]
Docs[]

You see the Docs column is repeated multiple times.
And Docs is the imp column , which is array of documents.

In the source of data flow, I am getting the error as duplicate column.
{"message":"Job failed due to reason: at Source 'Json': org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: Attachments, Docs;.

I am also trying to read this file as a delimited file and then see whether I can remove it.
Do you have any solution regarding how can I process it?

Or any other way I can load it?

MartinJaffer-MSFT 26,026 Reputation points

2021-11-03T21:40:57.75+00:00
Hello @Anonymous and welcome to Microsoft Q&A.

If you can read as a delimited file, then try using index / ordinal based mapping rather than named mapping.

What is an "imp column" ?

Can you help me get a better idea of what a sample record would look like? You said it was JSON, so I am guessing something like:

{ "Docs":["d1","d2","d3"], "Attachments": "123.jpg", "Docs":["d1","d2","d3"], "Attachments": "123.jpg", "Indexes":[1,5,12], "Transformers":["foo","bar"], "Docs":["d1","d2","d3"], }

It is important to recognize whether the data is also duplicated, or the data is different and in need of merging.
Shailendra Kad 11 Reputation points

2021-11-04T07:32:40.353+00:00

Thanks for the reply.
Yes, it is a json.
Data is not duplicated but the Columns are duplicate.

Sample record is like below
{
"Docs": [
{
"RunOn": "2020-04-03T04:50:28.1064257Z",
"Version": 1,
"Client": "All",
"DatabaseType": "Client",
"IndexName": "DeclarationLogs/Search",
"@metadata": {
"Raven-Entity-Name": "IndexUpdates",
"Raven-Clr-Type": "Cas.Common.Domain.DbModel.IndexUpdate.IndexUpdate, Cas.Root",
"Ensure-Unique-Constraints": [],
"@id": "IndexUpdates/0001",
"Last-Modified": "2020-04-03T04:50:28.1072484Z",
"Raven-Last-Modified": "2020-04-03T04:50:28.1072484",
"@etag": "01000000-0000-0001-0000-000000000001",
"Non-Authoritative-Information": false
}
}

Others columns are also almost same,
But I am interested in only Docs column.

I did not understand what is index based mapping,
I tried to read this json file as a delimited file , but it is reading every line and it needs row and column delimiter.
MartinJaffer-MSFT 26,026 Reputation points

2021-11-09T17:59:46.11+00:00

I do not see the duplicates in your sample record, @Anonymous .

Is this something you want to bring in through copy activity or data flow? Synapse or Data Factory?
Shailendra Kad 11 Reputation points

2021-11-10T07:25:08.717+00:00

Hi,
Thanks for the reply.
Above I have just mentioned the sample json Entry , it does not contain duplicate.
Attached is the sample which has duplicate column , Docs.

Also, I have multiple arrays and also array inside array inside the JSON.
Hence, copy activity is not useful. I tried using Copy but it takes only first column and does not support arrays / complex. This can only be done in Data flows using Flatten transformation.
[148083-sample.txt][1]
Shailendra Kad 11 Reputation points

2021-11-10T16:28:24.787+00:00

Also one more thing to notice , If I am trying to increase the custer size to more than 64 then getting below error.

{
"errorCode": "134",
"message": "Internal Server Error:Failed to submit job on job cluster. Integration Runtime 'D9F15919-6B5A-4194-98FE-B5D80B6B65B2', ActivityId: 'e27c34b8-8c98-425a-afd1-22a1ff83546a'.",
"failureType": "SystemError",
"target": "DF_Load_CustomsShipment",
"details": []
}