How to get the Schema of list of Parquet files in a folder

Jaganathan, NK. (Naveen Kumar) 0 Reputation points
2023-07-14T16:58:25.4833333+00:00

Hi All,

I have to Create a file which will have the Parquet File name and its Schema into a CSV file. I can use Metadata and Child Items to get the list of files in a folder. But getting the Schema of each file is where i am stuck can you please suggest some solutions.

Thanks,

Naveen

Azure Analysis Services
Azure Analysis Services
An Azure service that provides an enterprise-grade analytics engine.
438 questions
{count} votes

1 answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 37,896 Reputation points Microsoft Employee
    2023-07-17T05:57:35.88+00:00

    Hi Jaganathan, NK. (Naveen Kumar),

    Thank you for posting query in Microsoft Q&A Platform.

    Hope you are trying to use Azure data factory for this case. Kindly try below to make it work.

    Use GetMetaData Activity with ChildItems filed selected in it, to list down all parquet file names from folder. And then use ForEach Activity to loop over each file name. Inside ForEach Activity use another GetMetaData Activity with Structure field selected. This gives you schema of parquet files. Here use copy activity, in Copy activity, use source as some dummy file and use additional column option to send schema of parquet file to sink. Use SQL table as sink in copy activity.

    Once loop completes, in SQL table we will have each file name with schema of it. Now, you can use another copy activity outside of ForEach Activity to copy that table data to csv file.

    Please consider check below videos to understand few of components of above implementation.

    Add additional columns during copy in Azure Data Factory

    Get Metadata Activity in Azure Data Factory

    For Each activity in Azure Data Factory

    Hope this helps. Please let me know if any further queries.


    Please consider hitting Accept Answer button. Accepted answers help community as well. Thank you.