Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,481 questions
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Hi Expert,
how to remove chars from column in databricks delta when datatype is not null
col1
1
2
3
er
ge
e
Hello Shambhu Rai,
To remove characters from a column in Databricks Delta, you can use the regexp_replace function from PySpark. This function replaces all substrings of the column’s value that match the pattern regex with the replacement string.
from pyspark.sql.functions import regexp_replace, when
from pyspark.sql.types import IntegerType
# create a sample dataframe with col1
data = [("1",), ("2",), ("3",), ("er",), ("ge",), ("e",)]
df = spark.createDataFrame(data, ["col1"])
# remove non-numeric characters from col1
df = df.withColumn("col1", regexp_replace("col1", "[^0-9]", ""))
# cast col1 to integer type
df = df.withColumn("col1", df["col1"].cast(IntegerType()))
# replace empty strings with null
df = df.withColumn("col1", when(df["col1"] == "", None).otherwise(df["col1"]))
# display the output
df.show()
I hope this helps.
I would go for using regexp to replace unwanted characters :
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
spark = SparkSession.builder.appName("DataCleaning").getOrCreate()
df = spark.read.format("delta").load("/path/to/your/delta/table")
df_cleaned = df.withColumn("col1_clean", regexp_replace("col1", "er|ge", ""))
df_cleaned.show()