Reading Parquet file in c# from Azure Blob Storage

Admin (KK) 136 Reputation points
2021-07-23T10:00:51.347+00:00

Hello,

I am trying to read a parquet files from the storage accounts. I am using parquet.net library for reading the parquet files. My goal is to reading all the parquet files in the storage account and check which columns has null values.

I tried using azure databricks pyspark, however since some of the column names has special characters its not working. I tried pandas in azure databricks, its taking long time for processing. Hence i tried using azure functions with c# . However i am getting error since each parquet file has different order of columns.

Could someone help me in what other options i have or how can i fix this

string connectionString = "<<storage account connection string>>";
log.LogInformation($"C# Timer trigger function executed at: {DateTime.Now}");
BlobServiceClient blobServiceClient = new BlobServiceClient(connectionString);
string containerName = "containername";
BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient(containerName);

        //Getting the sample structure of the blob to create the final table
        string sampleBlobName = "Blobname.parquet";
        BlobClient sampleblobClient = containerClient.GetBlobClient(sampleBlobName);
        var sampleStream = sampleblobClient.OpenRead();
        var sampleReader = new ParquetReader(sampleStream);
        Table sampleTable = sampleReader.ReadAsTable();
        DataField[] datafields = sampleTable.Schema.GetDataFields();
        Table finalTable = new Table(datafields);

        //looping through the container for getting the blobs and read the file and add the rows to the final table 
        await foreach (BlobItem blobItem in containerClient.GetBlobsAsync())
        {

            if (blobItem.Name.Contains("2021072")) //for testing
            {
                log.LogInformation("\t" + blobItem.Name);
                BlobClient blobClient = containerClient.GetBlobClient(blobItem.Name);
                var stream = blobClient.OpenRead();
                var reader = new ParquetReader(stream);

                Table table = reader.ReadAsTable();

                foreach (Row row in table)
                {
                    finalTable.Add(row,);
                }
                log.LogInformation(finalTable.Count.ToString());
            }

        }
Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
5,909 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,192 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. MayankBargali-MSFT 70,936 Reputation points Moderator
    2021-07-29T04:45:44.7+00:00

    @Admin (KK) Apology for the delay. As I understand correctly the issue is more on the usage of parquet-dotnet library. I am not the expert on parquet-dotnet usage but looking into the code I can see that you are looping through the BlobItems and as you have mentioned that you are getting the exception for different blob as they can have different columns/Schema so the below code should be inside the foreach loop and you need to update your other code reference accordingly. While reading the individual blob it should get their own schema and I think this should help you.

    var sampleStream = sampleblobClient.OpenRead();  
    var sampleReader = new ParquetReader(sampleStream);  
    Table sampleTable = sampleReader.ReadAsTable();  
    DataField[] datafields = sampleTable.Schema.GetDataFields();  
    Table finalTable = new Table(datafields);  
    

    If you have already resolved the issue feel fell free to post it as as answer so it can help community.

    2 people found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.