First you need to mount your Azure Blob storage to Databricks to access files in the blob storage. You can do this using the dbutils.fs.mount(
) method and precise the storage account name, container name, and access key.
Once it is done, you can use butils.fs.ls()
to list all files in a directory, it will return a list of FileInfo
objects, each containing details like path, name, and last modified time.
To filter them based on the last modified time, you can write a function that compares the last modified time of each file with the desired timestamp.
If you receive files every two minutes , think about setting up a scheduled job in Databricks that runs every 2 minutes.
# Mount blob storage
dbutils.fs.mount(source="blob storage path", mount_point="/mnt/my_mount_point", extra_configs={"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<access-key>"})
# Function to filter and load files
def load_recent_files():
files = dbutils.fs.ls("/mnt/my_mount_point/")
for file in files:
if file.lastModified >= desired_timestamp: # Replace 'desired_timestamp' with your condition
df = spark.read.csv(file.path)
# Process and load the dataframe as needed
# NB: Don't forget to schedule this function to run at your desired interval