Default Copy / Merge Activity Behaviour - Different Columns

Question

I am encountering issues where ADF's Copy activity's default behaviour is not mapping columns in a case-sensitve manner as per https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping

I have a set of files that I need to merge - for example, 2020.parquet, 2021.parquet, and 2022.parquet.

2020.parquet has the following schema:
timeentry, createdby, actualdate

2021.parquet has the following schema:
createdby, timeentry, actualdate

2022.parquet has the following schema:
actualdate, createdby, timeentry, comments

If I attempt to merge these files using a Parquet dataset with no defined schema and the copy activity has no defined schema with the Merge files copy behaviour, this throws errors because it is trying to write actualdate (a timestamp) into timeentry (a string) / viceversa. The error message is as follows: Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An unknown error occurred.,Source=Microsoft.DataTransfer.Common,''Type=System.InvalidOperationException,Message=Invalid datetime object,Source=Microsoft.DataTransfer.Richfile.ParquetBridge
Troubleshooting by having a CSV intermediate step, I get a file with four columns timeentry, createdby, actualdate, comments and the files are merged such that 2021's createdby is written in the timeentry column, 2022's timeentry is in the actualdate column, etc. The CSV files have a header row and First row as header is checked in the dataset.
I've cleared the schemas on the datasets and the mapping on the activity multiple times, but the issue persists.
I cannot explicitly map the schema as doing so results in a 'column comments not found' error for 2020.parquet and 2021.parquet. Leaving comments out of the schema prevents that data from flowing to the sink.

Isn't the intended effect of the Merge Files copy behaviour to produce a single file with timeentry, createdby, actualdate, comments columns with the data mapped appropriately?

Accepted Answer

Hello @Todd Lazure and welcome to Microsoft Q&A.

So the goal is to merge several files whose collumns are in differing order, and one has more columns than the rest. The mapping, whether default or explicit taking the column order of one file, and applying it to the others, when it should be doing each by name.

I have reproduced your findings. The advertised behavior of default schema mapping does not cooperate with merging files. I suspect when they designed the merge feature, they assumed it would be for merging like files with same number and order of columns.

I have found a way around the comments column. I added in source the "additional column" option, and choose "comments" for name, and give a static, custom value. I found the custom value showed up for the rows which didn't have comments in source. Those that did have comments in source kept their data, and did not use the custom value. However, I found the mismatched data persisted!

I agree this is particularly frusterating. I will report this internally and submit feature request.

I also tried this in Data Flow, and still found the mismatch of data persisted, though it did handle the 4th column better.

Share via

Default Copy / Merge Activity Behaviour - Different Columns

0 additional answers