ADF - Recursive COPY

Aneesh Kumar A L 21 Reputation points
2020-12-18T08:12:45.57+00:00

Hi Team,
We have created an ADF COPY activity where it copies the CSV files from one ADLS source container directory to another ADLS destination directory and convert the CSV files to PARQUET.
Multiple files will be read recursively from the source directories using the wildcard filter and preserve the same folder hierarchy in the destination as well.
Eg:
49347-image.png

In ADF we have used File Path type as ‘Wildcard file path’ and Wild card folder path as the root Directory eg: ‘RootFolder\’

Question:

  1. Is it possible to have the destination filename can be user defined with the same source folder hierarchy? From the above example we want the destination file name should be FileT1.parquet and FileT2.parquet respectively.
  2. We have enabled fault tolerance in copy data setting, to skip the incompatible rows (since we are passing a dynamic JSON for data type mapping when the files are transforming from CSV to parquet). The skipped row details will be saved in a blob location. But the log does not provide an information from which file the row has be skipped. Is it possible to get it?
  3. Also, if any one of the files have failed during the copy process, the log and the error does not provide an information about the file which cause the issue when we recursively copy the files. Is it possible to get it?
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
{count} votes

Answer accepted by question author
  1. Nasreen Akter 10,891 Reputation points Volunteer Moderator
    2020-12-30T15:51:08.253+00:00

    Hi @Aneesh Kumar A L ,

    1: yes, it's possible to have the target file in the same hierarchy but the filename would be source: rootFolder/category/.../test.csv --> des: rootFolder/category/.../test.parquet, if you keep the same FilePath in Source and Sink Dataset. Only using the CopyActivity with Preserve Hierarchy, I am afraid it's not possible to change the filename to be user-defined yet.

    2 and 3: it possible with only copy activity. Add additional column $$FILEPATH in the CopyActivity-->source, no additional change need to be done in the mapping side, unless your source filetype change something else than .csv (e.g, .json, in such case, just reset the mapping and re-map. when filepath column appears, exclude the column in the mapping). You will get the relative-filepath information in the log file. Thanks! :)

    Please see the screenshots for details:

    52209-copy-1.jpg
    52198-copy-2.jpg

    Please let me know if this helps. If the above response helps, please "accept the answer" and "up-vote" the same! Thank you!

    2 people found this answer helpful.
    0 comments No comments

4 additional answers

Sort by: Most helpful
  1. HimanshuSinha 19,527 Reputation points Microsoft Employee Moderator
    2020-12-21T14:47:18.877+00:00

    Hello @Aneesh Kumar A L ,
    Thanks for the ask and using the forum .

    As already mentioned you are using the recursive option . Can you tell us to how many files are trying to copy ?
    One option which I can see is the Getmetadata actvity to get all the files and then pass that to the foreach loop .Inside the FE we can add the
    copy activity . This way you should be able to create
    the folder structure on the sink side . The getdatameta activity has an limit of 4 MB which we will have to consider while planning .
    This also should take care of Pt3 of the ask .
    I am not sure we can do much about number #2 .

    Let me know if this helps .

    Thanks
    Himanshu


  2. Aneesh Kumar A L 21 Reputation points
    2020-12-24T11:21:22.35+00:00

    Hi @HimanshuSinha ,
    Thanks for your update!
    The number of files is huge, it close to 14000. Also does the Getmetadata activity retrieve the file information recursively, if we provided a root directory? I think it won't.

    I have already tried one workaround option that will resolve the three questions, but it have a performance impact than the recursive copy activity. What I had done is: get the all the file details from the root directory using 'azcopy.exe list ' command and loaded to a table and iterate the files using FE in ADF, then each file will be passed to the copy activity parallelly. But It process only 20 files at a time and for each COPY activity it does not scale DIUs, parallel copies and peak connections since we are processing one file in a copy activity.

    So it would be helpful if we have a solution for the above three questions without any performance issue.

    0 comments No comments

  3. MartinJaffer-MSFT 26,161 Reputation points
    2020-12-30T01:09:34.943+00:00

    For Question #1 I was able to copy and have the suffix change (file.csv -> file.parquet) but I was not able to cause it to be appended (file.csv -> file.csv.parquet).

    In my wildcard filepath folder I used *
    In my wildcard filename I used *.csv

    I tried this with blob, data lake gen 2, and data lake gen 1.
    What settings and data store were you using?

    0 comments No comments

  4. Aneesh Kumar A L 21 Reputation points
    2021-01-04T07:01:51.213+00:00

    Hi @Nasreen Akter , Thanks for your valuable input!
    For the Question #1, based on your input I have set the CopyActivity with Preserve Hierarchy in sink side and ran. But it still produce the file name as 'test.csv.parquet'. I have attached the screenshots for your reference. Please let me know if I miss anything.53118-image.png

    53151-image.png

    53161-image.png

    For the Question #2 and #3 I have tried with the file path option, but in the mapping tab we could not able to exclude the destination column name for the file path. Please see the screen shot below. Are you using custom JSON for the mapping?
    53181-image.png


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.