skip header row in pipe delimited file using synapse pyspark

Question

skip header row in pipe delimited file using synapse pyspark

rajendar erabathini 616

HI - I have a pipe delimited file which has header in first row follows with detail rows. I am developing a pyspark notebook in synapse analytics to skip the first row using skipRows but its not working. Please note that I am NOT using databricks. I am writing code in spark notebook in synapse analytics.

df1 = spark.read.options(delimiter='\r',header="true",skipRows=1) \

.csv("abfss://******@usfdmdls.dfs.core.windows.net/folder1/folder2/filename")

as a work around i have filtered out the header row using where clause from the dataframe.

header=df1.first()[0]

df2=df1.where(df1['_c0']!=header)

now I have a dataframe with pipe delimited row. Can someone please help me with code to split the row in dataframe into multiple columns based on pipe delimiter and assign the column names like Prop0, Prop1... dynamically based on number of columns. Please note that i have several files and each files has different number of columns so I do not want to give the column name manually rather i want to assign the column names dynamically as a generalized approach.

Please review my above approach and let me know if there is any better way to meet the goal. Please help me with code snippet.

Please let me know if you have any further questions and thanks for your help.

Thanks,

RR netha

Accepted answer

0 additional answers

Your answer

Answer 1

@rajendar erabathini ,

Thankyou for using Microsoft Q&A platform and thanks for posting your query here.

From the description of your query, I can sense that you want to skip rows from the dataframe using synapse notebook as well as you want to split single column into multiple by pipe delimiter. Please let me know if my understanding is incorrect.

Try using Lambda function in pyspark to achieve the requirement of skipping the rows. Here are few helpful resources:

For splitting the columns, you can make use of split function on top of df1:

   
   df2 = df1.withColumn("temp", split(df1['col1'], '|'))
   
   
   num_cols = len(df2.select("temp").first()[0])
   
   col_names = ["Prop" + str(i) for i in range(num_cols)]
   
   
   for i in range(num_cols):
       df2 = df2.withColumn(col_names[i], df2.temp[i])
   
   
   df2 = df2.drop("temp")
   
   df2.show()

Here is the reference documentation : PySpark split() Column into Multiple Columns

Hope it helps. Please accept the answer if it was helpful. Thanks

rajendar erabathini 616 Reputation points

2023-03-14T05:14:24.0133333+00:00

Thank you @AnnuKumari-MSFT - This hleps. Why Pspark is not giving option to skiprows as it is available with Pandas read_csv() function.

thanks
AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-03-14T09:39:37.86+00:00

@rajendar erabathini ,

Unfortunately its not present in Dataframe options in pyspark. Thanks for accepting the answer

Share via

skip header row in pipe delimited file using synapse pyspark

0 additional answers

Your answer