Is there any way to push down a filter when running "dbutils.fs.ls" in a Databricks notebook?

Question

Is there any way to push down a filter when running "dbutils.fs.ls" in a Databricks notebook?

Mohammad Saber 591

I have a container in an Azure blob storage that contains around 10,000,000 CSV and Zip files.

I want to use "dbutils.fs.ls" in a Databricks notebook to get a list of files. However, after running the command and waiting for more than 30 minutes, I got the below error:

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

I have used a multi-node cluster:

Driver: Standard_D4ds_v5
Worker: Standard_D4ds_v5
Min Workers: 2
Max Workers: 8

It seems that the cluster cannot handle getting the list of all files. I was wondering if I could push down a filter on filenames and get the list of files after filtering. I am interested in files starting with "Energy". In this way, it might be possible to get a list of desired files without the above error.

2 answers

Your answer

Answer 1

Amira Bedhiafi 33,071 Volunteer Moderator

Try to use the list comprehension in python, it might not be efficient for a very large number of files but can be a quick fix if the number of files is manageable.

files = dbutils.fs.ls("/path/to/your/directory")
filtered_files = [file for file in files if file.name.startswith("Energy")]

If the number of files is extremely large, another approach is to parallelize the operation using Spark. You can read the directory as a DataFrame and then filter the it based on your requirements.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FilterFiles").getOrCreate()

df = spark.read.format("binaryFile").load("/path/to/your/directory").select("path")

filtered_df = df.filter(df.path.contains("Energy"))
filtered_files = [row.path for row in filtered_df.collect()]

Mohammad Saber 591 Reputation points

2023-12-10T01:53:41.88+00:00

Thanks

When I run the command

dbutils.fs.ls("/path/to/your/directory")

The cluster stops automatically as mentioned above.

A question about the below command. Does it read the content of all files in the blob storage?

df = spark.read.format("binaryFile").load("/path/to/your/directory").select("path")

Answer 2

Hi Mohammad Saber,

I see you have accepted the answer for this question in Stackoverflow.

I'll repost the solution in case you'd like to "Accept" the answer as it helps the other community members looking for similar answer to a similar question.

Solution:

One way to achieve this is by using the list_blobs method from the Azure SDK. This method allows to filter the results based on the file name.

https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python#azure-storage-blob-containerclient-list-blobs

If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

I hope this helps!

Share via

Is there any way to push down a filter when running "dbutils.fs.ls" in a Databricks notebook?

2 answers

Your answer