Want to filter csv files based on condition in adf

Amar Agnihotri 926 Reputation points
2022-05-18T13:46:34.277+00:00

Hi,
I am having 300 csv files stored in adls container. Now i am trying to copy only 50 files out of 300 files to another container . Also out of those 50 files i want to filter some files based on some conditions like date >= '01-01-2019' for one file and Total count >= 100 for another file. Here, date is a column in one file and total count is a column in another file. I want to do this in adf. Earlier when i was having source data coming from Mysql db i did this successfully using lookup activity which was picking up file name and schema name from the look up file and i also defined the condition column in lookup file and then used If activity to apply multiple condition on different tables. But here the source is csv file and i don't know how to apply those condition on csv files. Please suggest something to get it done .

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,562 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,655 questions
{count} votes

5 answers

Sort by: Most helpful
  1. Nandan Hegde 36,156 Reputation points MVP Volunteer Moderator
    2022-05-18T17:12:40.1+00:00

    Hey,
    Based on my understanding it is not possible directly via copy activity. You would need to use dataflow.

    Below link might help:
    https://learn.microsoft.com/en-us/answers/questions/552825/copy-activity-select-from-csv-dataset.html

    2 people found this answer helpful.

  2. ShaikMaheer-MSFT 38,551 Reputation points Microsoft Employee Moderator
    2022-05-23T10:38:15.54+00:00

    Hi @Amar Agnihotri ,

    Thank you for posting query in Microsoft Q&A Platform.

    You can achieve this using below steps.
    Step1: Lookup file on storage which tells us which files to process and along with conditions. use Lookup activity in ADF pipeline to read this file data.
    204627-image.png

    Step2: Continue your foreach & If activities as it is.

    Step3: In data flow activity have parameters to receive values from pipeline and inside Data flow activity pass values to those parameters.
    204654-image.png

    Hope this helps. Please let us know if any further queries.

    -----------

    Please consider hitting Accept Answer. Accepted answers help community as well.

    1 person found this answer helpful.

  3. Amar Agnihotri 926 Reputation points
    2022-05-22T15:16:13.31+00:00

    @Nandan Hegde

    In the above snap you can see that If activity will check the condition IF_FILTERED =1 then true will execute otherwise FALSE will execute according to the column defined in lookup file as shown
    204377-image.png

    and you can see that IS_FILTERED = 1 is only for task, user_location and booking_by_day csv so only the condition for these csv will be executed in the data flow to filter. This is the logic that i am trying to implement
    Now inside the true activity i am calling the dataflow
    204343-image.png
    and i want to pass the table name dynamically using the parameter defined in the dataflow -

    204441-image.png

    In the red circle i want to pass the table name dynamically coming from the adf but don't know how to pass that . you suggested the blow link to take help from

    https://learn.microsoft.com/en-us/answers/questions/552825/copy-activity-select-from-csv-dataset.html

    but in that link foiltering is done on suing color and size condition but color and size condition is present in every table . In my scenario i want to filter three different tables using three different conditions for each table as defined in lookup file . Also the schema of all the tables are different . so i am stuck at these points -

    1) - How to pass dynamic table names to dataflow on each execution of if activity for true condition ?
    2) - and how to use filter condition between source and sink for three different tables .

    Hope you understand my point now

    0 comments No comments

  4. Amar Agnihotri 926 Reputation points
    2022-05-23T12:52:40.063+00:00

    @ShaikMaheer-MSFT ,
    Hi there is one more thing to tell you. I tried too define parameters in dataflow as you did but don't know why i got this error.
    204685-image.png

    Also want to tell you that for the data flow i am using dataset as a container which is having all the 305 csv files.

    204707-image.png
    204650-image.png

    This is the source container for dataflow

    204713-image.png

    It seems from your description that to define condition in a parameter in dataflow i have to select lookup file as source data set for dataflow but if i do so then what about the other files that i want to copy as it is in the sink . Please clear this

    0 comments No comments

  5. Amar Agnihotri 926 Reputation points
    2022-05-23T14:48:54.157+00:00

    @ShaikMaheer-MSFT
    Hi,

    This is my adf pipeline and you can see that i have created parameter here
    204724-image.png

    but i am not able to call these parameters inside dataflow

    204725-image.png


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.