How to read this multiline CSV file as a spark data frame

Question

How to read this multiline CSV file as a spark data frame

Vishal D 5

Hello,

Below is the multiline CSV file sample data delimited with semicolons (;)

Reference no;"Status";"Proj";"Series";"Note";

99V2A0001;"Draft";"PEV";"VP";"PVO";

89V2Z0001;"Accepted";"L541";"VP1";"Person could not catch the delay. Supplier deliver LU:2019/12/23 WD3

Moden Wood:20/01/98 W15

Fl";

99C939993;"Accepted";"V31";"V12";"frigerant, ThermalHeater and Coordinates. The interim sol is to "run the time"

VU1 plann";

99V2A0B01;"Accepted";"519A";"B89";"Problem 1: The 59 TT series were planned to get the "73"/TT but RT 18w44

they VP series "1"/ZP, this";

I've tried creating a spark data frame with the below code in the attached image and got the output data frame with 6 rows whereas the input file has only 4 rows with it.

Screenshot (172)

Due to extra quotes ("") present in the last column for rows 3 & 4, I believe Spark couldn't able to read it in a single row. Please help me with how to resolve this issue. (Expected number of output rows is 4)

Note: I've highlighted the extra quotes in bold for your reference

Regards,

Vishal

AnnuKumari-MSFT 34,566 Reputation points Microsoft Employee Moderator

2023-06-28T17:58:45.6733333+00:00
Hi Vishal D ,

Welcome to Microsoft Q&A platform and thanks for posting your question here.

I understand that multiline record is causing issue in parsing the data for your dataframe.

One way to resolve this issue is to use a custom CSV parser that can handle the extra quotes and extra line. You can use the "spark-csv" package to read the CSV file with a custom parser.

Here is an example code snippet that you can use to read the CSV file with a custom parser:

from pyspark.sql.functions import * from pyspark.sql.types import * customSchema = StructType([ StructField("Reference no", StringType(), True), StructField("Status", StringType(), True), StructField("Proj", StringType(), True), StructField("Series", StringType(), True), StructField("Note", StringType(), True) ]) df = spark.read \ .option("header", "true") \ .option("delimiter", ";") \ .option("parserLib", "univocity") \ .option("quote", "\"") \ .option("escape", "\"") \ .schema(customSchema) \ .csv("path/to/csv/file") df.show()

Hope it helps. Please let us know how it goes. Thankyou
Vishal D 5 Reputation points

2023-06-29T05:21:48.17+00:00

Hello AnnuKumari-MSFT,

I tried creating the spark data frame with the code you provided but unfortunately, it didn't help. It's reading each line as a separate record. Please refer to the attached pic for your reference.

sample.txt Attaching the sample file that I've used for creating the data frame. Can you please share the other possible ways to fix this? :)

Regards,

Vishal
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Your answer

AnnuKumari-MSFT 34,566 Reputation points Microsoft Employee Moderator

2023-06-28T17:58:45.6733333+00:00

Hi Vishal D ,

Welcome to Microsoft Q&A platform and thanks for posting your question here.

I understand that multiline record is causing issue in parsing the data for your dataframe.

One way to resolve this issue is to use a custom CSV parser that can handle the extra quotes and extra line. You can use the "spark-csv" package to read the CSV file with a custom parser.

Here is an example code snippet that you can use to read the CSV file with a custom parser:

from pyspark.sql.functions import * from pyspark.sql.types import * customSchema = StructType([ StructField("Reference no", StringType(), True), StructField("Status", StringType(), True), StructField("Proj", StringType(), True), StructField("Series", StringType(), True), StructField("Note", StringType(), True) ]) df = spark.read \ .option("header", "true") \ .option("delimiter", ";") \ .option("parserLib", "univocity") \ .option("quote", "\"") \ .option("escape", "\"") \ .schema(customSchema) \ .csv("path/to/csv/file") df.show()

Hope it helps. Please let us know how it goes. Thankyou
Vishal D 5 Reputation points

2023-06-29T05:21:48.17+00:00

Hello AnnuKumari-MSFT,

I tried creating the spark data frame with the code you provided but unfortunately, it didn't help. It's reading each line as a separate record. Please refer to the attached pic for your reference.

sample.txt Attaching the sample file that I've used for creating the data frame. Can you please share the other possible ways to fix this? :)

Regards,

Vishal
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Share via

How to read this multiline CSV file as a spark data frame

Your answer