Azure Databricks pyspark losing rows when renaming columns

Matthieu Marshall 6 Reputation points
2022-09-21T14:39:12.613+00:00

I can no longer re-create this issue, apologies.


I have a line of pyspark that I am running in databricks:

df = df.toDF(*[format_column(c) for c in df.columns])

where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.

Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:

import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])

although the following is ok without the aliasing:

import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])

and it is also ok if I don't rename all columns such as:

import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])

And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).

Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?

Running on Databricks Runtime 10.4LTS.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,080 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Matthieu Marshall 6 Reputation points
    2022-09-30T07:15:43.557+00:00

    @HimanshuSinha-msft thanks for your questions. I have since tried to re-create this issue without success.

    I would therefore prefer to close this and will follow up if it occurs again. Thanks

    0 comments No comments