Is there any way to push down a filter when running "dbutils.fs.ls" in a Databricks notebook?

Mohammad Saber 591 Reputation points
2023-12-09T06:36:34.13+00:00

I have a container in an Azure blob storage that contains around 10,000,000 CSV and Zip files. 

I want to use "dbutils.fs.ls" in a Databricks notebook to get a list of files. However, after running the command and waiting for more than 30 minutes, I got the below error:

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

I have used a multi-node cluster:

  • Driver: Standard_D4ds_v5
  • Worker: Standard_D4ds_v5
  • Min Workers: 2
  • Max Workers: 8

It seems that the cluster cannot handle getting the list of all files. I was wondering if I could push down a filter on filenames and get the list of files after filtering. I am interested in files starting with "Energy". In this way, it might be possible to get a list of desired files without the above error. 

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,116 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,371 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 29,946 Reputation points
    2023-12-09T15:53:35.8533333+00:00

    Try to use the list comprehension in python, it might not be efficient for a very large number of files but can be a quick fix if the number of files is manageable.

    files = dbutils.fs.ls("/path/to/your/directory")
    filtered_files = [file for file in files if file.name.startswith("Energy")]
    

    If the number of files is extremely large, another approach is to parallelize the operation using Spark. You can read the directory as a DataFrame and then filter the it based on your requirements.

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("FilterFiles").getOrCreate()
    
    df = spark.read.format("binaryFile").load("/path/to/your/directory").select("path")
    
    filtered_df = df.filter(df.path.contains("Energy"))
    filtered_files = [row.path for row in filtered_df.collect()]
    

  2. Bhargava-MSFT 31,226 Reputation points Microsoft Employee
    2023-12-27T16:34:53.71+00:00

    Hi Mohammad Saber,

    I see you have accepted the answer for this question in Stackoverflow.

    I'll repost the solution in case you'd like to "Accept" the answer as it helps the other community members looking for similar answer to a similar question.

    Solution:

    One way to achieve this is by using the list_blobs method from the Azure SDK. This method allows to filter the results based on the file name.

    https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python#azure-storage-blob-containerclient-list-blobs

    If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

    I hope this helps!

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.