parquet in data lake Vs .csv file

bk 466 Reputation points
2021-05-03T20:46:45.3+00:00

Hi All,
What are the advantages of loading source data into the Azure storage explorer as parquet format Vs .csv file. Any disadvantages as well.Please help.
Thanks

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator
    2021-05-03T23:04:28.29+00:00

    Hello @bk ,
    Thanks for the ask and using the Microsoft Q&A platform .

    Advantages of Storing Data in a Columnar Format:

    • Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time consuming compared to row-oriented databases. This way of storage has translated into hardware savings and minimized latency for accessing data.
    • Apache Parquet is built from the ground up. Hence it is able to support advanced nested data structures. The layout of Parquet data files is optimized for queries that process large volumes of data, in the gigabyte range for each individual file.
    • Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster). Data can be compressed by using one of the several codecs available; as a result, different data files can be compressed differently.
    • Apache Parquet works best with interactive and serverless technologies like AWS Athena, Amazon Redshift Spectrum, Google BigQuery and Google Dataproc.

    Difference Between Parquet and CSV

    CSV is a simple and widely spread format that is used by many tools such as Excel, Google Sheets, and numerous others can generate CSV files. Even though the CSV files are the default format for data processing pipelines it has some disadvantages:

    Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time, hence the overall costs.

    You can read more here .

    Please do let me know how it goes .
    Thanks
    Himanshu
    Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.