Reading Parquet file in c# from Azure Blob Storage

Question

Reading Parquet file in c# from Azure Blob Storage

Admin (KK) 136

Hello,

I am trying to read a parquet files from the storage accounts. I am using parquet.net library for reading the parquet files. My goal is to reading all the parquet files in the storage account and check which columns has null values.

I tried using azure databricks pyspark, however since some of the column names has special characters its not working. I tried pandas in azure databricks, its taking long time for processing. Hence i tried using azure functions with c# . However i am getting error since each parquet file has different order of columns.

Could someone help me in what other options i have or how can i fix this

string connectionString = "<<storage account connection string>>";
log.LogInformation($"C# Timer trigger function executed at: {DateTime.Now}");
BlobServiceClient blobServiceClient = new BlobServiceClient(connectionString);
string containerName = "containername";
BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient(containerName);

        //Getting the sample structure of the blob to create the final table
        string sampleBlobName = "Blobname.parquet";
        BlobClient sampleblobClient = containerClient.GetBlobClient(sampleBlobName);
        var sampleStream = sampleblobClient.OpenRead();
        var sampleReader = new ParquetReader(sampleStream);
        Table sampleTable = sampleReader.ReadAsTable();
        DataField[] datafields = sampleTable.Schema.GetDataFields();
        Table finalTable = new Table(datafields);

        //looping through the container for getting the blobs and read the file and add the rows to the final table 
        await foreach (BlobItem blobItem in containerClient.GetBlobsAsync())
        {

            if (blobItem.Name.Contains("2021072")) //for testing
            {
                log.LogInformation("\t" + blobItem.Name);
                BlobClient blobClient = containerClient.GetBlobClient(blobItem.Name);
                var stream = blobClient.OpenRead();
                var reader = new ParquetReader(stream);

                Table table = reader.ReadAsTable();

                foreach (Row row in table)
                {
                    finalTable.Add(row,);
                }
                log.LogInformation(finalTable.Count.ToString());
            }

        }

1 answer

Your answer

Answer 1

@Admin (KK) Apology for the delay. As I understand correctly the issue is more on the usage of parquet-dotnet library. I am not the expert on parquet-dotnet usage but looking into the code I can see that you are looping through the BlobItems and as you have mentioned that you are getting the exception for different blob as they can have different columns/Schema so the below code should be inside the foreach loop and you need to update your other code reference accordingly. While reading the individual blob it should get their own schema and I think this should help you.

var sampleStream = sampleblobClient.OpenRead();  
var sampleReader = new ParquetReader(sampleStream);  
Table sampleTable = sampleReader.ReadAsTable();  
DataField[] datafields = sampleTable.Schema.GetDataFields();  
Table finalTable = new Table(datafields);

If you have already resolved the issue feel fell free to post it as as answer so it can help community.

Admin (KK) 136 Reputation points

2021-07-29T10:19:55.397+00:00

Hello Mayank,

Thanks for getting back to me. I am still trying to fix the issue.

II tried the above fix, however i am still facing the issue. All I want is basically i want to parse all the parquet files for last n days and put it in to a table and i need to query the table for some value availability checks.

Share via

Reading Parquet file in c# from Azure Blob Storage

1 answer

Your answer