Share via

How to query only most recent parquet file in ADLS g2 with folder partitions by year/month/day

JasonW-5564 161 Reputation points
Nov 1, 2023, 7:06 PM

I have a pipeline that drops a new parquet file into a folder structure that looks like this:

/MyContainer/DataSourceName/year=2023/month=11/day=01

and the file name has the pipeline execution timestamp embedded as well

image

I am curious how to write an OPENROWSET type query I can use with serverless Synapse SQL pool that will only show results for the most recent .parquet file.

For instance, this query show results for all parquet files in the path:

SELECT * FROM OPENROWSET(BULK 'https/someWSname.dfs.core.windows.net/someFSname/DSname_TimekeeperAttribute/*.parquet', FORMAT= 'PARQUET') AS [result]

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,191 questions
{count} votes

4 answers

Sort by: Most helpful
  1. Konstantinos Passadis 19,386 Reputation points MVP
    Nov 1, 2023, 10:31 PM

    Hello @JasonW-5564 !

    Suppose your recent file is 'https://<storage-account>.dfs.core.windows.net/<file-system>/<path>/recentfile.parquet'

    As you already know we can use :

    SELECT * FROM 
    OPENROWSET(
      BULK '<file-path>',
      FORMAT => 'PARQUET'
    ) AS [result]
    
    

    To identify the file use the serverless SQL pool :

    WITH FileNames AS (
        SELECT 
            filename() AS FileName,
            -- Parse the timestamp from the file name
            SUBSTRING(
                filename(),
                CHARINDEX('TimekeeperAttribute_', filename()) + 20, 
                24
            ) AS FileTimestamp
        FROM 
            OPENROWSET(
                BULK 'https://<storage-account-name>.dfs.core.windows.net/<file-system-name>/<directory-path>/*.parquet',
                FORMAT='PARQUET'
            ) AS [result]
    )
    SELECT TOP 1
        FileName
    FROM 
        FileNames
    ORDER BY
        FileTimestamp DESC;
    
    
    

    I took the code from another Job with CSV i was using and modified it

    Can you please give it a go ?


    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards

    0 comments No comments

  2. JasonW-5564 161 Reputation points
    Nov 2, 2023, 5:04 AM

    Thanks. This helps. I had to put [result]. in front of the filename() function to get the script to run. How can I then pass that filename value into an OPENROWSET command so that it only opens\queries that file? In other words, that query you sent helps me identify the most recent file, but now how do I limit my queries to only that file's content and not all .parquet files in the directory?

    0 comments No comments

  3. Konstantinos Passadis 19,386 Reputation points MVP
    Nov 2, 2023, 3:52 PM

    Hello @JasonW-5564 !

    Thank you for your message

    You need to store the filename obtained from your initial query in a variable

    • In your ADF pipeline, go to the “Variables” tab.
    • Click “+ Add” to add a new variable.
    • Name your variable (e.g., FileName) and set the Type to String.

    Then :

    Add a Lookup activity to your pipeline.

    Configure the Lookup activity to use the appropriate linked service and dataset.

    Enter your identification query in the “Settings” tab of the Lookup activity.

    Then :

    Add a Set Variable activity to your pipeline.

    Connect the Lookup activity to the Set Variable activity.

    In the Set Variable activity settings:

    Choose the variable that you created earlier

     

    Use an expression to set the variable value to the output of the Lookup activity. For example:

    @activity('LookupActivityName').output.firstRow.FileName

    Now that you have stored the filename in a variable, you can use this variable in subsequent activities in your pipeline. For example, if you want to pass it to a stored procedure, you can reference the variable in the stored procedure activity’s parameters section.

    Kindly check the procedures and come back with your feedback !

    References :

    https://learn.microsoft.com/en-us/azure/data-factory/concepts-parameters-variables

    https://learn.microsoft.com/en-us/answers/questions/783890/set-variable-from-lookup-activity-result

    https://michalmolka.medium.com/azure-data-factory-a-lookup-for-each-and-variables-865707906680


    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards

     


  4. Konstantinos Passadis 19,386 Reputation points MVP
    Nov 8, 2023, 10:17 PM

    Hello @JasonW-5564 !

    Thanks for the message !

    First of all

    You are aware that OPENROWSET does not directly support ordering or filtering based on file properties like timestamps.

    That only points to a direction where we need an external mechanism to identify the most recent file , and then use the result in an OPENROWSET query

    So the closest thing would be to parse all files but as you already said ...not efficient !

    Another possbile solution would be a Stored Procedure , but again wont run on Serverless as SP but it could be done possibly via a Notebook and T-SQL . so you need something to do the listing and find the latest timestamped file ( Azure Functions or Logic Apps) and also construct the T-SQL

    all and all natively in Synapse i think it is not possible


    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.