remove chars from integer columns in databricks delta

Question

remove chars from integer columns in databricks delta

Shambhu Rai 1,411

Hi Expert,

how to remove chars from column in databricks delta when datatype is not null

col1

1

2

3

er

ge

e

2 answers

Your answer

Answer 1

Hello Shambhu Rai,

To remove characters from a column in Databricks Delta, you can use the regexp_replace function from PySpark. This function replaces all substrings of the column’s value that match the pattern regex with the replacement string.

from pyspark.sql.functions import regexp_replace, when
from pyspark.sql.types import IntegerType

# create a sample dataframe with col1
data = [("1",), ("2",), ("3",), ("er",), ("ge",), ("e",)]
df = spark.createDataFrame(data, ["col1"])

# remove non-numeric characters from col1
df = df.withColumn("col1", regexp_replace("col1", "[^0-9]", ""))

# cast col1 to integer type
df = df.withColumn("col1", df["col1"].cast(IntegerType()))

# replace empty strings with null
df = df.withColumn("col1", when(df["col1"] == "", None).otherwise(df["col1"]))

# display the output
df.show()

User's image

I hope this helps.

Answer 2

Amira Bedhiafi 33,071 Volunteer Moderator

I would go for using regexp to replace unwanted characters :

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace

spark = SparkSession.builder.appName("DataCleaning").getOrCreate()

df = spark.read.format("delta").load("/path/to/your/delta/table")

df_cleaned = df.withColumn("col1_clean", regexp_replace("col1", "er|ge", ""))

df_cleaned.show()

Shambhu Rai 1,411 Reputation points

2023-12-10T09:05:44.55+00:00

sir please check it is not null condtion
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-12-10T17:20:25.8466667+00:00

Madam please. I don't understand your comment
Shambhu Rai 1,411 Reputation points

2023-12-10T17:34:15.5333333+00:00

when datatype is not null... we can not add null as replacement

Share via

remove chars from integer columns in databricks delta

2 answers

Your answer