remove chars from integer columns in databricks delta

Question

Hi Expert,

how to remove chars from column in databricks delta when datatype is not null

col1

1

2

3

er

ge

e

Answer

Hello Shambhu Rai,

To remove characters from a column in Databricks Delta, you can use the regexp_replace function from PySpark. This function replaces all substrings of the column’s value that match the pattern regex with the replacement string.

from pyspark.sql.functions import regexp_replace, when
from pyspark.sql.types import IntegerType

# create a sample dataframe with col1
data = [("1",), ("2",), ("3",), ("er",), ("ge",), ("e",)]
df = spark.createDataFrame(data, ["col1"])

# remove non-numeric characters from col1
df = df.withColumn("col1", regexp_replace("col1", "[^0-9]", ""))

# cast col1 to integer type
df = df.withColumn("col1", df["col1"].cast(IntegerType()))

# replace empty strings with null
df = df.withColumn("col1", when(df["col1"] == "", None).otherwise(df["col1"]))

# display the output
df.show()

User's image

I hope this helps.

Answer

I would go for using regexp to replace unwanted characters :

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace

spark = SparkSession.builder.appName("DataCleaning").getOrCreate()

df = spark.read.format("delta").load("/path/to/your/delta/table")

df_cleaned = df.withColumn("col1_clean", regexp_replace("col1", "er|ge", ""))

df_cleaned.show()

Share via

remove chars from integer columns in databricks delta

2 answers

Your answer