get latest date

arkiboys 9,186 Reputation points
2022-03-31T14:42:32.06+00:00

Hello,
using ADF, I populate the adlsgen2 directory with parquet files as follows:

xxx/2022/02/26/parquet files
...
xxx/2022/03/30/parquet files
xxx/2022/03/31/parquet files

Question:
In databricks notebook, how is it possible to only query the latest date directory?
for example, for the above scenario, I would like to query the latest date which is 2022/03/31/*
I do not want to type in the date but the notebook pyspark should know to find the latest date directory and look for the .parquet files in there.

Thank you

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,703 questions
0 comments No comments
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 28,001 Reputation points Microsoft Employee
    2022-04-04T12:36:24.797+00:00

    Hi @arkiboys ,

    Thankyou for using Microsoft Q&A platform and posting your query.

    As I understand your ask here is to fetch the latest date file present in the ADLS directory. Please correct me if my understanding is incorrect.

    For this requirement , you need to navigate to the latest year folder and get the latest month subfolder present within that , then loop inside the month folder to get the latest date sub folder then get all the files present in that particular date folder.

    I have files in following folder structure for the demo:
    data/2022/02/26/parquet files
    ...
    data/2022/03/02/parquet files

    I created notebook in my databricks workspace and executed following steps:

    1. Use dbutils.fs.ls to list down all the month subfolders present within year 2022 folder. Iterate through them using for loop and sort in reverse order

    fileInfos = dbutils.fs.ls('/FileStore/data/2022/')  
    monthPaths = []  
    for fileinfo in fileInfos:  
        monthPaths.append(fileinfo.path)  
      
    monthPaths.sort(reverse=True)  
    

    189735-image.png

    2. Use dbutils.fs.ls to list down all the dates subfolders present within latest month folder. Iterate through them using for loop and sort in reverse order

    fileInfos = dbutils.fs.ls(monthPaths[0])  
    dayPaths = []  
    for fileInfo in fileInfos:  
        dayPaths.append(fileInfo.path)  
      
    dayPaths.sort(reverse=True)  
    print(dayPaths)  
    

    189737-image.png

    3. Concat '*.parquet' to fetch all the files present in that particular day. The output will give us the files present in the latest date folder.

    latestDatePath = dayPaths[0] + '*.parquet'  
    print(latestDatePath)  
    

    189724-image.png

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you.
      Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

0 additional answers

Sort by: Most helpful