Remove characters in from values pyspark

Shambhu Rai 1,406 Reputation points
2022-08-18T18:32:32.387+00:00

Hi Expert,

How to remove characters from column values pyspark sql

I.e gffg546, gfg6544

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,336 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,340 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,907 questions
{count} votes

Accepted answer
  1. Dillon Silzer 54,286 Reputation points
    2022-08-18T20:24:27.46+00:00

    Hi @Shambhu Rai

    You can use this with Spark Tables + Pandas DataFrames:

    Example:

    import pandas as pd  
      
    df=spark.table("your.table_name")  
      
    pddf = df.toPandas()  
      
    pddf['ColumnName']=pddf['ColumnName'].replace(regex=[r'\D+'], value="")   
      
    display(pddf)  
    

    Output without replace:

    232574-image.png

    Output after using the replace function:

    232560-image.png

    Documentation for converting spark to pd

    https://docs.databricks.com/spark/latest/spark-sql/spark-pandas.html

    --------------------------

    If this is helpful please accept answer.

    0 comments No comments

5 additional answers

Sort by: Most helpful
  1. Shambhu Rai 1,406 Reputation points
    2022-08-18T20:02:15.07+00:00

    Suggestion please

    0 comments No comments

  2. Dillon Silzer 54,286 Reputation points
    2022-08-18T20:02:45.637+00:00

    Hi @Shambhu Rai

    You can process the pyspark table in panda frames to remove non-numeric characters as seen below:

    Example code: (replace with your pyspark statement)

    import pandas as pd  
       
    df = pd.DataFrame({  
         'A': ['gffg546', 'gfg6544', 'gfg65443213123'],  
    })  
      
    df['A'] = df['A'].replace(regex=[r'\D+'], value="")   
    display(df)  
    

    232596-image.png

    Cited from: https://stackoverflow.com/questions/44117326/how-can-i-remove-all-non-numeric-characters-from-all-the-values-in-a-particular

    ----------------------------

    If this is helpful please accept answer.

    0 comments No comments

  3. Shambhu Rai 1,406 Reputation points
    2022-08-18T20:07:47.3+00:00

    Hi Expert,

    How to do it on column level and get values 10-25 as it is in target column. Istead of 'A' can we add column


  4. Shambhu Rai 1,406 Reputation points
    2022-08-18T20:08:03.997+00:00

    Hi Expert,

    How to do it on column level and get values 10-25 as it is in target column. Istead of 'A' can we add column

    0 comments No comments