Need T-sql User Defined Function to remove specified words or chars from a table field.

SQL9 246 Reputation points
2021-05-09T02:31:48.827+00:00

Hi All,

I am trying to replace below words or chars from a given name field(company name or person's name) and I want a generic SQL user defined function which should remove all these given words or chars that are passed to the UDF function.

Basically I am storing all these chars or words in a Title table and in future if I add any new word or char to the Titles table then the function should remove the given word or char from the FullName field.

My Source field FullName values looks like below..

FullName:
John Doe Corporate
Mary Jones llp
Doctor Whu ltd
Leia & Organa MD
A&O Office
Dr. Ray Wed
James Adams Jr.
William Duke iii
Mr.John King Sr.
First Step, Inc.
A-Z Needs, Inc.
ABC 100 Sales, ltd

I need a UDF function to show output in below format. I am using SQL server 2016 & 2019 version

Note: 1) I have millions of records to process, so need something which is faster to process(not cursor functionality) the records.
2) It should remove numeric values (reg exp) if any.

FullName NewFullName
John Doe Corporate John Doe
Mary Jones llp Mary Jones
Doctor Whu ltd Doctor Whu
Leia & Organa MD Leia Organa
A&O Office AO Office
Dr.Ray Wed Ray Wed
James Adams Jr. James Adams
William Duke iii William Duke
Mr.John King Sr. John King
First Step, Inc. First Step
A-Z Needs, Inc. AZ Needs
ABC 100 Sales, ltd ABC Sales

CREATE TABLE #titles (title nvarchar(50));
INSERT INTO #titles (title)
VALUES (' & '),('&'),('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'corporate'),( 'llp'),( 'llc'),( 'ltd'),( 'limited'),( 'the'),( 'dr'),( 'md'),( 'rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),( '3rd'),( 'aprn'),( 'pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),( 'cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' ),('.'),(','),('-')

CREATE TABLE #Data (fullname nvarchar(100));
INSERT INTO #Data (fullname) VALUES
('John Doe Corporate'),
('Mary Jones llp'),
('Doctor Whu ltd'),
('Leia & Organa MD'),
('A&O Office'),
('Dr.Ray Wed'),
('James Adams Jr.'),
('William Duke iii'),
('Mr.John King Sr.'),
('First Step, Inc.'),
('A-Z Needs, Inc.'),
('ABC 100 Sales, ltd')

Thanks in advance,
RH

Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,552 questions
{count} votes

Accepted answer
  1. Yitzhak Khabinsky 24,941 Reputation points
    2021-05-09T16:38:25.18+00:00

    Hi @SQL9 ,

    Please try the following solution.
    It is pure T-SQL and will work starting from SQL Server 2017 onwards.
    It is up to you to package it as a user defined function.

    I separated the entire process into two steps:

    1. words cleansing.
    2. chars cleansing.

    SQL

    -- DDL and sample data population, start  
    DECLARE @titles TABLE  (ID INT IDENTITY PRIMARY KEY, title NVARCHAR(50));  
    INSERT INTO @titles (title) VALUES   
    (' & '),('&'),('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'Dr.'),  
    ('corporate'),( 'llp'),( 'llc'),( 'ltd'),('Inc.'),( 'limited'),( 'the'),( 'dr'),( 'md'),  
    ('rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),( '3rd'),( 'aprn'),  
    ('pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),  
    ('cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' ),('.'),(','),('-');  
      
    DECLARE @badChars NVARCHAR(100) = '0123456789&.,-';  
    -- DDL and sample data population, end  
      
    DECLARE @inputParameter NVARCHAR(100) = N'A-Z Needs, Inc.';  
    DECLARE @output NVARCHAR(100);  
      
    -- Step #1, cleanse words  
    SELECT @output = STRING_AGG(cleansedWords, SPACE(1))  
    FROM (SELECT value FROM STRING_SPLIT(@inputParameter, SPACE(1))  
     WHERE value NOT IN (  
     SELECT title FROM @titles)) AS t(cleansedWords);  
      
    -- test, cleansed words  
    SELECT @inputParameter, @output;  
      
    -- Step #2, cleanse chars  
    SELECT @output = TRIM((SELECT STRING_AGG(REPLACE(TRANSLATE(value, @badChars, SPACE(LEN(@badChars))),SPACE(1),''), SPACE(1))  
     FROM STRING_SPLIT(@output, SPACE(1))  
     ));  
      
    -- test, cleansed chars  
    SELECT @output AS Result;  
    

3 additional answers

Sort by: Most helpful
  1. EchoLiu-MSFT 14,571 Reputation points
    2021-05-10T07:33:36.757+00:00

    Hi @SQL9 ,

    As ErlandSommarskog said, tsql alone seems to be difficult to achieve your needs. Nested replacement may be possible, but that efficiency is not necessarily higher than your existing cursor.

    Regards
    Echo


  2. Tom Phillips 17,716 Reputation points
    2021-05-10T20:52:01.2+00:00

    With only the information you have given, it is going to be extremely difficult to do in any language.

    If in fact you have other attributes, like address, you are likely better off using something like a white pages or USPS database to "standardize" your information. Something like https://www.smartystreets.com/. (I have used SmartyStreets in the past, but have no affiliation to them).

    Without knowing more about how you would use this information, it is impossible to guess if that will work for your situation or not.

    0 comments No comments

  3. Erland Sommarskog 100.9K Reputation points MVP
    2021-05-09T12:30:55.993+00:00

    Let's get this clear from the start: there is no good T-SQL-only solution for this.

    You can write a user-defined function where you loop over the Titles table and run a replace for each entry. If you say:

    UPDATE tbl SET fullname = dbo.StripJunk(fullname)
    

    SQL Server will essentially implement this as a cursor under the hood. You may have heard of inlining of scalar functions in SQL 2019, but because of the loop in the function it will not happen here.

    You could isntead build a dynamic SQL statement with an insane amount of nested replace() calls. I would expect this to run faster.

    However, both these solution has the problem that replace() does not understand word boundaries, so if you have Faultdata Inc., it will become Fauata . with your Titles table. Trying to handle the word boundaries in T-SQL? Loop over each character in the string? How efficient will that be? You don't want to go there.

    It still possible to run this from inside SQL Server, though. There are two options:

    1. Using the RegEx classes in .NET through a CLR function.
    2. Calling an external language with sp_execute_external_script.

    Using the CLR might be the most efficient solution. It also has the advantage that it works on all SQL Server versions from SQL 2005 and up. The disadvantage is that it comes with quite a bit of red tape, and more so in recent versions.

    The second option became available in SQL 2016. In SQL 2016, the only available language is R. SQL 2017 added Python and SQL 2019 added Java. (And also exposed the mechanisms to add your own language.)

    R is a specialised language for statistics, and I don't know whether there is support for regular expressions. Below is a solution that uses Python, so you cannot use it on SQL 2016. In the solution, I am only considering words, and handling punctuation characters are left as an exercise to the reader.

    DROP TABLE IF EXISTS Titles, Data
    
    CREATE TABLE Titles (title nvarchar(50));
    INSERT INTO Titles (title)
    VALUES ('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'corporate'),( 'llp'),( 'llc'),
    ( 'ltd'),( 'limited'),( 'the'),( 'dr'),( 'md'),( 'rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),
    ( '3rd'),( 'aprn'),( 'pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),( 'cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' )
    
    
    CREATE TABLE Data (id int IDENTITY PRIMARY KEY, fullname nvarchar(100));
    INSERT INTO Data (fullname) VALUES
    ('John Doe Corporate'),
    ('Mary Jones llp'),
    ('Doctor Whu ltd'),
    ('Leia & Organa MD'),
    ('A&O Office'),
    ('Dr.Ray Wed'),
    ('James Adams Jr.'),
    ('William Duke iii'),
    ('Mr.John King Sr.'),
    ('First Step, Inc.'),
    ('A-Z Needs, Inc.'),
    ('ABC 100 Sales, ltd')
    
    DECLARE @pattern nvarchar(MAX)
    SELECT @pattern = '\b(' + string_agg(title, '|') + '|\d+)\b(?i)'
    FROM   Titles
    
    DROP TABLE IF EXISTS #result
    CREATE TABLE #result (id     int NOT NULL PRIMARY KEY,
                          result varchar(100) NULL)
    
    INSERT #result(id, result)
     EXEC sp_execute_external_script @language = N'Python', 
         @input_data_1 = N'SELECT id, fullname FROM Data', 
         @output_data_1_name = N'Ret',
         @script = N'import re, pandas
    Ret = InputDataSet
    Ret["fullname"] = pandas.Series([re.sub(Pattern, "", i) for i in Ret["fullname"]])
    ', 
         @params = N'@Pattern nvarchar(MAX)',
         @Pattern = @pattern
    
    UPDATE Data
    SET    fullname = isnull(r.result, '')
    FROM   Data d
    JOIN   #result r ON d.id = r.id
    WHERE  d.fullname <> isnull(r.result, '')
    
    SELECT * FROM Data
    

    In the end, you may prefer to read the data to a client-side program to do the processing there and write back.