Azure Databricks pyspark losing rows when renaming columns

Question

I can no longer re-create this issue, apologies.

I have a line of pyspark that I am running in databricks:

df = df.toDF(*[format_column(c) for c in df.columns])

where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.

Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:

import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])

although the following is ok without the aliasing:

import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])

and it is also ok if I don't rename all columns such as:

import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])

And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).

Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?

Running on Databricks Runtime 10.4LTS.

Answer

@HimanshuSinha-msft thanks for your questions. I have since tried to re-create this issue without success.

I would therefore prefer to close this and will follow up if it occurs again. Thanks

Share via

Azure Databricks pyspark losing rows when renaming columns

1 answer