Download all files in a folder from a http source

Christopher Mühl 106 Reputation points
2021-12-02T10:11:06.937+00:00

Hello,

I would like to use Data Factory to download all files of a folder from a HTTP source on a daily basis. The data will be stored temporarily in an ADLS2.
The source is this URL: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/

Currently I have a copy activity with a http linked service source and a binary dataset.
The copy activity works if I set a correct path to a specific file in the relative url setting of the dataset. (e.g. Base URL: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/ Relative URL: tageswerte_KL_00011_akt.zip)
If I only use the Basic URL the download doesn't work.

So I tried to get all the filenames in the http source. I tried to use "Get Metadata" but this function doesn't work for http sources.

What options do I have to download all files from this URL?

Thank you in advance!

Best regards
Christopher

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,977 questions
0 comments No comments
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,046 Reputation points
    2021-12-02T20:39:39.507+00:00

    Hello @Christopher Mühl and welcome to Microsoft Q&A!

    I have some good news and bad news. The good news is that a work-around to accomplish getting all the sub-files is possible. The bad news is the work-around is a bit clunky and awkward, unless you want to do some work outside of Azure Data Factory.

    I can explain why Get Metadata doesn't work here, or why you can't wildcard HTTP sources, but I do not know if you want to read the explanation. Let me know if you do.

    So the high level view of the work around is to:

    1. Use Web Activity to fetch the contents of your Base URL.
    2. Use Set Variable activity to split the output of Web Activity into individual entries, 1 per file, and store into an Array type variable
    3. Clean up the entries (stored in the array variable) so they are usable as Relative URL
    4. Iterate over the entires, passing each one into a Copy activity

    I am currently working out the implementation details right now. My progress as of writing this is between 2 and 3.

    2 people found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Christopher Mühl 106 Reputation points
    2021-12-03T11:29:48.937+00:00

    Hello @MartinJaffer-MSFT

    first of all, thank you for your input and effort.
    In the meantime, I was already on a very similar path to solve the problem.
    Yes, the solution is a bit clunky and awkward, but it works for now - at least until something changes at the source.

    The bad news is the work-around is a bit clunky and awkward, unless you want to do some work outside of Azure Data Factory.

    What options do you see here? I was thinking of pre-processing in Azure Functions or possibly Azure Logic Apps.

    I would also be very interested to know why Get Metadata doesn't work here and why I can't wildcard HTTP sources.

    And as a last point, I would have 2 questions about the current solution:

    • Is there any way to tell the Delete Activity to delete all files in this folder except e.g. product_climate_tag_*.txt? Or do you always have to specify explicitly or via a wildcard the files you want to delete?
    • After splitting the WebActivity into an array, is there a way to filter out certain array elements? In my case I want to have only the .zip files. This would also have the advantage that I can use the Copy Activity directly to unzip the archives.

    Thanks again for your effort!

    Best regards,
    Christopher