skip header row in pipe delimited file using synapse pyspark

rajendar erabathini 616 Reputation points
2023-03-11T05:16:26.3633333+00:00

HI - I have a pipe delimited file which has header in first row follows with detail rows. I am developing a pyspark notebook in synapse analytics to skip the first row using skipRows but its not working. Please note that I am NOT using databricks. I am writing code in spark notebook in synapse analytics.

df1 = spark.read.options(delimiter='\r',header="true",skipRows=1) \

  .csv("abfss://******@usfdmdls.dfs.core.windows.net/folder1/folder2/filename")

as a work around i have filtered out the header row using where clause from the dataframe.

header=df1.first()[0] 

df2=df1.where(df1['_c0']!=header)

now I have a dataframe with pipe delimited row. Can someone please help me with code to split the row in dataframe into multiple columns based on pipe delimiter and assign the column names like Prop0, Prop1... dynamically based on number of columns. Please note that i have several files and each files has different number of columns so I do not want to give the column name manually rather i want to assign the column names dynamically as a generalized approach.

Please review my above approach and let me know if there is any better way to meet the goal. Please help me with code snippet.

Please let me know if you have any further questions and thanks for your help.

Thanks,

RR netha

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,243 questions
0 comments No comments
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 34,361 Reputation points Microsoft Employee
    2023-03-13T19:09:32.4833333+00:00

    @rajendar erabathini ,

    Thankyou for using Microsoft Q&A platform and thanks for posting your query here.

    From the description of your query, I can sense that you want to skip rows from the dataframe using synapse notebook as well as you want to split single column into multiple by pipe delimiter. Please let me know if my understanding is incorrect.

    • Try using Lambda function in pyspark to achieve the requirement of skipping the rows. Here are few helpful resources:
    1. how to skip first few rows from data file in pyspark
    2. How to skip lines while reading a CSV file as a dataFrame using PySpark?
    • For splitting the columns, you can make use of split function on top of df1:
       
       df2 = df1.withColumn("temp", split(df1['col1'], '|'))
       
       
       num_cols = len(df2.select("temp").first()[0])
       
       col_names = ["Prop" + str(i) for i in range(num_cols)]
       
       
       for i in range(num_cols):
           df2 = df2.withColumn(col_names[i], df2.temp[i])
       
       
       df2 = df2.drop("temp")
       
       df2.show()
       
    

    Here is the reference documentation : PySpark split() Column into Multiple Columns


    Hope it helps. Please accept the answer if it was helpful. Thanks

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.