How to merge files by picking them dynamically

Ranjan, Vineet 1 Reputation point
2021-10-17T10:50:11.217+00:00

There are a lot of files with timestamp suffixed in their filename. I keep getting few of those files each day, and some of those few files have multiple copies of them with different timestamp (hh or mm or ss component of timestamp differ on each day).

I have a pipeline that copies each of those files everyday into time partitioned folders with a granularity of "yyyy/mm/dd" and so if there are multiple copies of a files present on a day with different timestamp, they get overwritten when copied into time-partitioned folder.

I am thinking of merging all copies of a file during copy as well as remove duplicate rows after merge and suffix the merged filename with most recent timestamp among all copies of the file.

How can I achieve this please ?

Sample file in a container -
141070-image.png

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,544 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Ranjan, Vineet 1 Reputation point
    2021-10-19T03:14:48.457+00:00

    Hi Sathyamoorthy,

    Thanks for looking into this. truly appreciate it.

    I am developing in PRE-PROD now where SAP extract jobs are failing and hence they are sending multiple copies of few files with different HHMMSS component of timestamp unfortunately.

    There was no issues in DEV and QA environments and hence I did not face this issue.

    In PRE-PROD, I am copying each filename-copy separately and once loaded into SQLDB at the end of the flow, I come back and lift subsequent copy of the filename and process them. Fortunately, there are only few files having multiple copies on a day, but I still need to absorb the use case in my pipeline so when I schedule them, pipelines work in all scenario.

    0 comments No comments

  2. Ranjan, Vineet 1 Reputation point
    2021-10-19T03:57:01.387+00:00

    I am thinking on below lines at the moment.

    1. Use "Get Metadata" activity with "ChildItems" to list all files for copy.
    2. Write this list of files to a Azure SQL table.
    3. Use a "Lookup" with "query" to get the list of distinct filenames without the timestamp.
      So if you refer to screenshot of files, Lookup activity should get me following -
      0CRM_SALES_ACT_1.json
      CRM_0BP_DEF_ADDRESS_ATTR.json
      ZDSOUTPAYMETH.json
    4. Now I will iterate over this list one at a time and use a "Copy" activity with SOURCE File Path type selected as "Wildcard File Path" and pass <filename>_20*.json which will pick all copies of the <file>.

    At the Sink tab of above "Copy" , I will choose "Merge files" for Copy behaviour and also pass the merged filename as <file>_<latest_timestamp>.json and write to another container of the storage.

    1. I will then use dataflow to remove duplicate rows and write to another container. From there I can do usual processing I have been doing .

    Please do advise how this sounds, It alright then I need to find how to do all of the steps in ADF code.

    Thanks
    Vineet

    0 comments No comments