remove chars from integer columns in databricks delta

Shambhu Rai 1,406 Reputation points
2023-12-08T12:02:08.32+00:00

Hi Expert,

how to remove chars from column in databricks delta when datatype is not null

col1

1

2

3

er

ge

e

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,346 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,382 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,926 questions
Azure Data Lake Analytics
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. BhargavaGunnam-MSFT 26,221 Reputation points Microsoft Employee
    2023-12-08T22:16:40.0766667+00:00

    Hello Shambhu Rai,

    To remove characters from a column in Databricks Delta, you can use the regexp_replace function from PySpark. This function replaces all substrings of the column’s value that match the pattern regex with the replacement string.

    from pyspark.sql.functions import regexp_replace, when
    from pyspark.sql.types import IntegerType
    
    # create a sample dataframe with col1
    data = [("1",), ("2",), ("3",), ("er",), ("ge",), ("e",)]
    df = spark.createDataFrame(data, ["col1"])
    
    # remove non-numeric characters from col1
    df = df.withColumn("col1", regexp_replace("col1", "[^0-9]", ""))
    
    # cast col1 to integer type
    df = df.withColumn("col1", df["col1"].cast(IntegerType()))
    
    # replace empty strings with null
    df = df.withColumn("col1", when(df["col1"] == "", None).otherwise(df["col1"]))
    
    # display the output
    df.show()
    
    
    

    User's image

    I hope this helps.

    0 comments No comments

  2. Amira Bedhiafi 15,446 Reputation points
    2023-12-08T22:41:33.2+00:00

    I would go for using regexp to replace unwanted characters :

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import regexp_replace
    
    spark = SparkSession.builder.appName("DataCleaning").getOrCreate()
    
    df = spark.read.format("delta").load("/path/to/your/delta/table")
    
    df_cleaned = df.withColumn("col1_clean", regexp_replace("col1", "er|ge", ""))
    
    df_cleaned.show()