parquet in data lake Vs .csv file

Question

parquet in data lake Vs .csv file

bk 466

Hi All,
What are the advantages of loading source data into the Azure storage explorer as parquet format Vs .csv file. Any disadvantages as well.Please help.
Thanks

HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator

2021-05-18T22:12:36.647+00:00

Hello @bk ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others . Otherwise, will respond back with the more details and we will try to help .
Thanks
Himanshu

1 answer

Your answer

HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator

2021-05-18T22:12:36.647+00:00

Hello @bk ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others . Otherwise, will respond back with the more details and we will try to help .
Thanks
Himanshu

Answer 1

Hello @bk ,
Thanks for the ask and using the Microsoft Q&A platform .

Advantages of Storing Data in a Columnar Format:

Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time consuming compared to row-oriented databases. This way of storage has translated into hardware savings and minimized latency for accessing data.
Apache Parquet is built from the ground up. Hence it is able to support advanced nested data structures. The layout of Parquet data files is optimized for queries that process large volumes of data, in the gigabyte range for each individual file.
Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster). Data can be compressed by using one of the several codecs available; as a result, different data files can be compressed differently.
Apache Parquet works best with interactive and serverless technologies like AWS Athena, Amazon Redshift Spectrum, Google BigQuery and Google Dataproc.

Difference Between Parquet and CSV

CSV is a simple and widely spread format that is used by many tools such as Excel, Google Sheets, and numerous others can generate CSV files. Even though the CSV files are the default format for data processing pipelines it has some disadvantages:

Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time, hence the overall costs.

You can read more here .

Please do let me know how it goes .
Thanks
Himanshu
Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members

bk 466 Reputation points

2021-05-04T13:25:16.56+00:00

Thanks @Himanshu Sinha
I am getting the following error when i debug the pipe line
Failure happened on 'Sink' side. ErrorCode=JreNotFound,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Java Runtime Environment cannot be found on the Self-hosted Integration Runtime machine. It is required for parsing or writing to Parquet/ORC files. Make sure Java Runtime Environment has been installed on the Self-hosted Integration Runtime machine.,Source=Microsoft.DataTransfer.Common,''Type=System.DllNotFoundException,Message=Unable to load DLL 'jvm.dll': The specified module could not be found. (Exception from HRESULT: 0x8007007E),Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'

Where can i download the JRE and do i need to restart the server? Will it cause any issue?
Thanks
HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator

2021-05-11T18:30:31.243+00:00

Hello @bk ,

Here is the fix for the error https://learn.microsoft.com/en-us/troubleshoot/azure/general/error-run-copy-activity-azure

On a side note , I request you to always start a new thread . This improves the discoverablity for other users and helps a lot in growing community .

Thanks
Himanshu
Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members
HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator

2021-05-12T17:13:49.17+00:00

Hello @bk ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others . Otherwise, will respond back with the more details and we will try to help .
Thanks
Himanshu

Share via

parquet in data lake Vs .csv file

1 answer

Your answer