question

ShailendraKad-8896 avatar image
0 Votes"
ShailendraKad-8896 asked ShailendraKad-8896 commented

Read json file which is generated from ravendb export which has duplicate columns

Hi Team,

I want to load the json file generated from ravendb export.
This is rather complex file and has lot of arrays and strings in it.
Only issue is, it has 2 columns which are duplicate.
I mean ideally this json is not valid , as it has 2 columns which are present in the file multiple times.
Sample structure as below
Docs[]
Attachments
Docs[]
Attachments
Indexes[]
Transformers[]
Docs[]


You see the Docs column is repeated multiple times.
And Docs is the imp column , which is array of documents.

In the source of data flow, I am getting the error as duplicate column.
{"message":"Job failed due to reason: at Source 'Json': org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: Attachments, Docs;.

I am also trying to read this file as a delimited file and then see whether I can remove it.
Do you have any solution regarding how can I process it?

Or any other way I can load it?

azure-data-factoryazure-synapse-analytics
· 5
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @ShailendraKad-8896 and welcome to Microsoft Q&A.

If you can read as a delimited file, then try using index / ordinal based mapping rather than named mapping.

What is an "imp column" ?

Can you help me get a better idea of what a sample record would look like? You said it was JSON, so I am guessing something like:

 {
   "Docs":["d1","d2","d3"],
   "Attachments": "123.jpg",
   "Docs":["d1","d2","d3"],
   "Attachments": "123.jpg",
   "Indexes":[1,5,12],
   "Transformers":["foo","bar"],
   "Docs":["d1","d2","d3"],
 }

It is important to recognize whether the data is also duplicated, or the data is different and in need of merging.

1 Vote 1 ·

Thanks for the reply.
Yes, it is a json.
Data is not duplicated but the Columns are duplicate.

Sample record is like below
{
"Docs": [
{
"RunOn": "2020-04-03T04:50:28.1064257Z",
"Version": 1,
"Client": "All",
"DatabaseType": "Client",
"IndexName": "DeclarationLogs/Search",
"@metadata": {
"Raven-Entity-Name": "IndexUpdates",
"Raven-Clr-Type": "Cas.Common.Domain.DbModel.IndexUpdate.IndexUpdate, Cas.Root",
"Ensure-Unique-Constraints": [],
"@id": "IndexUpdates/0001",
"Last-Modified": "2020-04-03T04:50:28.1072484Z",
"Raven-Last-Modified": "2020-04-03T04:50:28.1072484",
"@etag": "01000000-0000-0001-0000-000000000001",
"Non-Authoritative-Information": false
}
}

Others columns are also almost same,
But I am interested in only Docs column.

I did not understand what is index based mapping,
I tried to read this json file as a delimited file , but it is reading every line and it needs row and column delimiter.

0 Votes 0 ·

I do not see the duplicates in your sample record, @ShailendraKad-8896 .

Is this something you want to bring in through copy activity or data flow? Synapse or Data Factory?

0 Votes 0 ·
Show more comments

0 Answers