How to get row counts from files stored in Azure Storage Account / Data Lake

Question

How to get row counts from files stored in Azure Storage Account / Data Lake

AzureHero 41

I know you used to be able to get row counts from files in directories with Data Lake Analytics U-SQL, but is there a way to get row counts on all files in a given directory directly from Azure Data Factory? I need to perform some validation tasks and don't have the ability to use U-SQL in the environment and Mapping Dataflows doesn't support MSI so I also can't use that. I feel like there's no options other than spinning up a Databricks resource and mounting directories and writing python. It seems like there should be an easy way to do this though in ADF but I'm not finding one.

Accepted answer

3 additional answers

Your answer

Answer 1

John Aherne 516

Probably not the cheapest or quickest option and assuming that you are using CSV, but you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.

Something like this:

Lookup Activity (Get's your list of base folders - Just for easy rerunning)

For Each (Base Folder)

Copy Recursively to temp folder

Store proc activity which stores the Copy Activity.output.rowsCopied

Delete temp files recursively.

AzureHero 41 Reputation points

2020-06-15T15:43:48.007+00:00

I actually thought about doing something like this as well. You're saying to iterate over all the files with a Source that represents my files, but a SINK that is a dummy file with one column so that it doesn't take up much storage and processes faster. That might be an option, thanks!
John Aherne 516 Reputation points

2020-06-15T16:17:12.63+00:00

The source dataset would be the single column - That way it would not matter what the actual columns are. This is assuming that you have different columns in your source files. If they are all the same structure, then yes, just copy one column for improved speed and smaller storage.

Answer 2

Hi @AzureHero ,

Yes you can very easily get the row counts from files stored in an Azure Blob Storage account. To do so, you would do the following :

Create a dataset to point to your blob storage till the folder (not the file) level as shown below :
Have a Get Metadata activity pointing to this dataset and in the fields, select "childItems".
Chain a ForEach activity to the Get Metadata with the items property as the output of the GetMetadata activity. Eg - @activity('getListOfFiles').output.childItems as shown below :

Within the ForEach loop, you can do anything at each file's level. All the file level validation can be handled here. In your case, to count number of rows, you would have a Lookup activity, with a wildcard file path set as "@item().name" as shown below :

Please note that the lookup activity has a limitation of only 5000 rows per dataset by default. Here's a workaround to overcome this :

Hope this helps. Stay safe!

Jeff vG 96 Reputation points

2021-05-17T21:44:23.777+00:00

What if I have a recursion of folders to do before I can take row counts, eg customer/day/file.xls? I have the examples working to display type: folders in my customer parent directory, but now i need to loop thru each file in each day subfolder and take a RowCount...right?

Answer 3

ChiragMishra-MSFT 956

Have a Get Metadata activity pointing to this dataset and in the fields, select "childItems".
Chain a ForEach activity to the Get Metadata with the items property as the output of the GetMetadata activity. Eg - @activity('getListOfFiles').output.childItems.
Within the ForEach loop, you can do anything at each file's level. All the file level validation can be handled here. In your case, to count number of rows, you would have a Lookup activity, with a wildcard file path set as "@item().name".

Please note that the lookup activity has a limitation of only 5000 rows per dataset by default.

Please use the links below to see images for the same :

Narasimha Murthy Pujari 1 Reputation point

2020-10-15T07:42:47.92+00:00

Hi ChiragMishra, could you please elaborate on the workaround, how to achieve inner/outer pipelines, in orde to get row count details?
Monica BB 1 Reputation point

2020-11-26T23:39:30.103+00:00

Hi ChiragMishra-MSFT. I have the same problem with more than 5000 rows. Please could you upload an example for this case?
Jatinder Luthra 130 Reputation points

2024-02-03T00:14:44.74+00:00

@ChiragMishra-MSFT any examples for inner/outer pipelines, in order to overcome 5000 rows limit. I have files with 10M+ records.

Answer 4

AzureHero 41

Thank you for the response. I have thousands of files and many of them will likely have over 5000 records which is why I couldn't use the lookup task to accomplish this. But I will look closer at the workaround on splitting the pipelines up into two inner/outer pipelines to see if that can work.

Narasimha Murthy Pujari 1 Reputation point

2020-10-15T07:40:30.007+00:00

Hi, were you able to get rowcount of the file with more that 5k rows, using inner/outer pipeline
Monica BB 1 Reputation point

2020-11-26T23:42:08.927+00:00

Hi AzureHero and NarasimhaMurthyPujari-6355. Maybe you have an example

Share via

How to get row counts from files stored in Azure Storage Account / Data Lake

3 additional answers

Your answer