question

RanjanVineet-4412 avatar image
0 Votes"
RanjanVineet-4412 asked RanjanVineet-4412 answered

How to merge files by picking them dynamically

There are a lot of files with timestamp suffixed in their filename. I keep getting few of those files each day, and some of those few files have multiple copies of them with different timestamp (hh or mm or ss component of timestamp differ on each day).

I have a pipeline that copies each of those files everyday into time partitioned folders with a granularity of "yyyy/mm/dd" and so if there are multiple copies of a files present on a day with different timestamp, they get overwritten when copied into time-partitioned folder.

I am thinking of merging all copies of a file during copy as well as remove duplicate rows after merge and suffix the merged filename with most recent timestamp among all copies of the file.

How can I achieve this please ?

Sample file in a container -
141070-image.png


azure-data-factory
image.png (86.8 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @RanjanVineet-4412,

Welcome to the Microsoft Q&A platform.

Just slightly need one clarification. You mentioned that the pipelines are copying and overwriting at the destination. Currently, how are you getting the files to be copied? I do see there other files.json - is there any wild card you are making use of ? Or you looking for ideas in filtering the files? Also, currently what are you making use to copy the data - copy activity / mapping data flow ?




0 Votes 0 ·
RanjanVineet-4412 avatar image
0 Votes"
RanjanVineet-4412 answered

Hi Sathyamoorthy,

Thanks for looking into this. truly appreciate it.

I am developing in PRE-PROD now where SAP extract jobs are failing and hence they are sending multiple copies of few files with different HHMMSS component of timestamp unfortunately.

There was no issues in DEV and QA environments and hence I did not face this issue.

In PRE-PROD, I am copying each filename-copy separately and once loaded into SQLDB at the end of the flow, I come back and lift subsequent copy of the filename and process them. Fortunately, there are only few files having multiple copies on a day, but I still need to absorb the use case in my pipeline so when I schedule them, pipelines work in all scenario.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

RanjanVineet-4412 avatar image
0 Votes"
RanjanVineet-4412 answered

I am thinking on below lines at the moment.

  1. Use "Get Metadata" activity with "ChildItems" to list all files for copy.

  2. Write this list of files to a Azure SQL table.

  3. Use a "Lookup" with "query" to get the list of distinct filenames without the timestamp.
    So if you refer to screenshot of files, Lookup activity should get me following -
    0CRM_SALES_ACT_1.json
    CRM_0BP_DEF_ADDRESS_ATTR.json
    ZDSOUTPAYMETH.json

  4. Now I will iterate over this list one at a time and use a "Copy" activity with SOURCE File Path type selected as "Wildcard File Path" and pass <filename>_20*.json which will pick all copies of the <file>.

At the Sink tab of above "Copy" , I will choose "Merge files" for Copy behaviour and also pass the merged filename as <file>_<latest_timestamp>.json and write to another container of the storage.

  1. I will then use dataflow to remove duplicate rows and write to another container. From there I can do usual processing I have been doing .

Please do advise how this sounds, It alright then I need to find how to do all of the steps in ADF code.

Thanks
Vineet







5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.