How to dynamically access files from mounted data lake in Databricks notebook?

Varun S Kumar 50 Reputation points
2023-10-29T13:46:39.73+00:00

Hello everyone,

I have a databricks notebook running some python code for ETL transformation of data from a CSV file. I have the csv files in a blob storage and have mounted the said storage to my notebook using dbutils.fs.mount

Now, the csv files are stored in the following directory structure: root/year/month/day/file.csv

For example, today being 29 October 2023, a file will be stored inside the blob with the following path: root/2023/10/29/file.csv

I have mounted the root of storage. I want to access the latest date's file every time I run the notebook. So, today I need to access csv inside root/2023/10/29/. But tomorrow when I run the notebook it should be root/2023/10/30/ and so on.

How can I bring about the said functionality using python code?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,374 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,492 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,975 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 80,096 Reputation points Microsoft Employee
    2023-10-30T05:39:58.1433333+00:00

    @Varun S Kumar - Thanks for the question and using MS Q&A platform.

    To dynamically access the latest date's file from the mounted blob storage in Databricks notebook, you can use the datetime module in Python to get the current date and then construct the path to the file based on the current date.

    Here's an example code snippet that you can use:

    import datetime
    
    # Get the current date
    now = datetime.datetime.now()
    
    # Construct the path to the file based on the current date
    path = f"/mnt//root/{now.year}/{now.month}/{now.day}/file.csv"
    
    # Read the CSV file using the constructed path
    df = spark.read.format("csv").option("header", "true").load(path)
    

    In the above code, replace <mount-point> with the name of the mount point you used when mounting the blob storage. The datetime.datetime.now() function returns the current date and time, which you can use to construct the path to the file. The f string syntax is used to construct the path string with the current year, month, and day.

    You can then use the constructed path to read the CSV file using the spark.read function. This will read the latest date's file every time you run the notebook.

    As per the repro, I had created similar structure in the ADLS Gen2 account.
    User's image

    And able to get the data as per your requirement:

    User's image

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful