last modified date in databricks

Shambhu Rai 1,411 Reputation points
2023-11-23T18:42:43.8766667+00:00

Hi,

Expert,

How to Pickup the files using last modified date from blob storage using databricks.. if receives 2 file in 2 mins interval how it will load it

/mnt/delta//Test.csv"
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,562 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,378 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,529 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,640 questions
{count} votes

Accepted answer
  1. Amira Bedhiafi 33,631 Reputation points Volunteer Moderator
    2023-11-24T12:33:10.02+00:00

    First you need to mount your Azure Blob storage to Databricks to access files in the blob storage. You can do this using the dbutils.fs.mount() method and precise the storage account name, container name, and access key.

    Once it is done, you can use butils.fs.ls() to list all files in a directory, it will return a list of FileInfo objects, each containing details like path, name, and last modified time.

    To filter them based on the last modified time, you can write a function that compares the last modified time of each file with the desired timestamp.

    If you receive files every two minutes , think about setting up a scheduled job in Databricks that runs every 2 minutes.

    # Mount blob storage
    dbutils.fs.mount(source="blob storage path", mount_point="/mnt/my_mount_point", extra_configs={"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<access-key>"})
    # Function to filter and load files
    def load_recent_files():
        files = dbutils.fs.ls("/mnt/my_mount_point/")
        for file in files:
            if file.lastModified >= desired_timestamp:  # Replace 'desired_timestamp' with your condition
                df = spark.read.csv(file.path)
                # Process and load the dataframe as needed
    # NB: Don't forget to schedule this function to run at your desired interval
    
    2 people found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator
    2023-11-27T10:25:22.7266667+00:00

    Hi Shambhu Rai,

    You can consider creating a function and having parameters for file name and file and call that function every time to create a view.

    Hope this helps. Please let me know if any further queries.


    Please consider hitting Accept Answer button. Accepted answers help community as well.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.