External Table throws error when queried in Synapse Dedicated SQL Pool When FIRST_ROW is set as 1

Question

I am trying to create an external table to access CSV files in Azure data lake (gen 2).

Background : We use the trickefeed offering by Microsoft Dynamics 365 Finance which moves all our data from Dynamics 365 to a data lake. The files in the data lake are in CDM format with no headers. This is not something I can change.
We now want to get this data into our dedicated SQL pool to use in our data warehouse.
External tables seem like the simplest way of getting this done.

I have followed the documentation mentioned to create external tables

Created a Credential
Created a data source
Created a file format
Created an external table

Strange thing is that it works. But only when I use FIRT_ROW=2 while creating the file format.
The issue is FIRT_ROW=2 leaves out the header row and I don't have a header row in my CSV, so there is a data loss of one row * The number of csv files.
If I leave out this configuration while creating the file format it sets a default of 1 after which when I query the table I get the error -
'HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Non-matching string delimiter.'

This works -

CREATE EXTERNAL FILE FORMAT [CSVHeader]
WITH (
FORMAT_TYPE = DELIMITEDTEXT
,FORMAT_OPTIONS(FIELD_TERMINATOR = ',', STRING_DELIMITER = '"', USE_TYPE_DEFAULT = False, FIRST_ROW=2)
)

But this doesn't -

CREATE EXTERNAL FILE FORMAT [CSVNoHeader]
WITH (
FORMAT_TYPE = DELIMITEDTEXT
,FORMAT_OPTIONS(FIELD_TERMINATOR = ',', STRING_DELIMITER = '"', USE_TYPE_DEFAULT = False)
)

Any help with this will be greatly appreciated.
Let me know if I can help with any other information.

Answer

Hi @Saurabh Sharma ,

Thank you for the quick reply.

I figured out what the issue was with this one, unfortunately I have hit another issue.
But let me first explain what resolved this one -
There was another file in the same folder called index.json , this is created by default by the tricklefeed (Microsoft d365 to the data lake) to save metadata on the table load.
Since this json file has one row, when I set FIRT_ROW=1 it errors out because that row does not have the same format as the csv files.

The data in the json file looked like
{"partitions":[{"partitionId":"CASHDISC_00001.csv","startRecId":0,"endRecId":5637149077,"completed":true}],"entityKey":"a28208f3-005a-4d91-817b-1eff94506ce8"}

To get by this I added ,REJECT_TYPE = VALUE ,REJECT_VALUE = 1 to the table creation.
This it rejects that one row in index.json and returns all the data from the CSV.
Let me know if there is a better way to do this?

Now to my current issue -

For one folder X (Calling it X for simplicity)
There are three csv's in the folder
X_1, X_2 and X_3
At one point there was a new column introduced from Dynamics 365 application side, this caused the new column to only come in from csv X_3.
X_1 and X_2 have 40 columns but X_3 has 42 columns from midway.
Example data in X_3 -
1,"Hello",52
2,"Hello1",53
1,"Hello",52,"cc"
1,"Hello",52,"dd"

When I run a select of the external table I get this-
Rows were rejected while reading from external source(s).
7 rows rejected from external table [DimensionAttributeValueCombination_Ext] in plan step 2 of query execution:
Msg 107090, Level 16, State 1, Line 78
107090;Query aborted-- the maximum reject threshold (1 rows) was reached while reading from an external source: 3 rows rejected out of total 3 rows processed. Ensure external data's file format and delimiters/terminators are consistent with source data (and where applicable, that source column counts/types match target).

Is there any way around this?
If I create a view in the serverless SQL pool on the same csv's this works fine , it's smart enough to understand this and not reject rows the but the external table fails.

The entire point of this exercise is that we want data from the lake to be easily accessible from the dedicated SQL pool to join with other tables in the current warehouse. We are stuck on external tables because of this issue.
The alternative is having pipelines which would create a copy of the data and we want to avoid that.

External Table throws error when queried in Synapse Dedicated SQL Pool When FIRST_ROW is set as 1

1 answer