How to get row counts from files stored in Azure Storage Account / Data Lake

AzureHero 41 Reputation points
2020-06-11T18:22:38.483+00:00

I know you used to be able to get row counts from files in directories with Data Lake Analytics U-SQL, but is there a way to get row counts on all files in a given directory directly from Azure Data Factory? I need to perform some validation tasks and don't have the ability to use U-SQL in the environment and Mapping Dataflows doesn't support MSI so I also can't use that. I feel like there's no options other than spinning up a Databricks resource and mounting directories and writing python. It seems like there should be an easy way to do this though in ADF but I'm not finding one.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,643 questions
0 comments No comments
{count} votes

Accepted answer
  1. John Aherne 516 Reputation points
    2020-06-15T15:05:12.217+00:00

    Probably not the cheapest or quickest option and assuming that you are using CSV, but you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
    At the end, delete everything in your temp folder.

    Something like this:

    Lookup Activity (Get's your list of base folders - Just for easy rerunning)

    For Each (Base Folder)

    Copy Recursively to temp folder

    Store proc activity which stores the Copy Activity.output.rowsCopied

    Delete temp files recursively.


3 additional answers

Sort by: Most helpful
  1. ChiragMishra-MSFT 956 Reputation points
    2020-06-12T11:06:25.463+00:00

    Hi @AzureHero ,

    Yes you can very easily get the row counts from files stored in an Azure Blob Storage account. To do so, you would do the following :

    • Create a dataset to point to your blob storage till the folder (not the file) level as shown below :
      9876-dataset-without-file-name.gif
    • Have a Get Metadata activity pointing to this dataset and in the fields, select "childItems".
    • Chain a ForEach activity to the Get Metadata with the items property as the output of the GetMetadata activity. Eg - @activity('getListOfFiles').output.childItems as shown below :

    9942-get-metadata-with-child-items-and-for-each.gif

    • Within the ForEach loop, you can do anything at each file's level. All the file level validation can be handled here. In your case, to count number of rows, you would have a Lookup activity, with a wildcard file path set as "@item().name" as shown below :

    9838-inner-lookup-activity.gif

    Please note that the lookup activity has a limitation of only 5000 rows per dataset by default. Here's a workaround to overcome this :

    9951-lookup-limitation.jpg

    Hope this helps. Stay safe!

    1 person found this answer helpful.

  2. ChiragMishra-MSFT 956 Reputation points
    2020-06-15T09:34:45.883+00:00
    • Have a Get Metadata activity pointing to this dataset and in the fields, select "childItems".
    • Chain a ForEach activity to the Get Metadata with the items property as the output of the GetMetadata activity. Eg - @activity('getListOfFiles').output.childItems.
    • Within the ForEach loop, you can do anything at each file's level. All the file level validation can be handled here. In your case, to count number of rows, you would have a Lookup activity, with a wildcard file path set as "@item().name".

    Please note that the lookup activity has a limitation of only 5000 rows per dataset by default.

    Please use the links below to see images for the same :


  3. AzureHero 41 Reputation points
    2020-06-15T14:24:07.43+00:00

    Thank you for the response. I have thousands of files and many of them will likely have over 5000 records which is why I couldn't use the lookup task to accomplish this. But I will look closer at the workaround on splitting the pipelines up into two inner/outer pipelines to see if that can work.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.