Need T-sql User Defined Function to remove specified words or chars from a table field.

Question

Need T-sql User Defined Function to remove specified words or chars from a table field.

SQL9 246

Hi All,

I am trying to replace below words or chars from a given name field(company name or person's name) and I want a generic SQL user defined function which should remove all these given words or chars that are passed to the UDF function.

Basically I am storing all these chars or words in a Title table and in future if I add any new word or char to the Titles table then the function should remove the given word or char from the FullName field.

My Source field FullName values looks like below..

FullName:
John Doe Corporate
Mary Jones llp
Doctor Whu ltd
Leia & Organa MD
A&O Office
Dr. Ray Wed
James Adams Jr.
William Duke iii
Mr.John King Sr.
First Step, Inc.
A-Z Needs, Inc.
ABC 100 Sales, ltd

I need a UDF function to show output in below format. I am using SQL server 2016 & 2019 version

Note: 1) I have millions of records to process, so need something which is faster to process(not cursor functionality) the records.
2) It should remove numeric values (reg exp) if any.

FullName NewFullName
John Doe Corporate John Doe
Mary Jones llp Mary Jones
Doctor Whu ltd Doctor Whu
Leia & Organa MD Leia Organa
A&O Office AO Office
Dr.Ray Wed Ray Wed
James Adams Jr. James Adams
William Duke iii William Duke
Mr.John King Sr. John King
First Step, Inc. First Step
A-Z Needs, Inc. AZ Needs
ABC 100 Sales, ltd ABC Sales

CREATE TABLE #titles (title nvarchar(50));
INSERT INTO #titles (title)
VALUES (' & '),('&'),('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'corporate'),( 'llp'),( 'llc'),( 'ltd'),( 'limited'),( 'the'),( 'dr'),( 'md'),( 'rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),( '3rd'),( 'aprn'),( 'pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),( 'cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' ),('.'),(','),('-')

CREATE TABLE #Data (fullname nvarchar(100));
INSERT INTO #Data (fullname) VALUES
('John Doe Corporate'),
('Mary Jones llp'),
('Doctor Whu ltd'),
('Leia & Organa MD'),
('A&O Office'),
('Dr.Ray Wed'),
('James Adams Jr.'),
('William Duke iii'),
('Mr.John King Sr.'),
('First Step, Inc.'),
('A-Z Needs, Inc.'),
('ABC 100 Sales, ltd')

Thanks in advance,
RH

Yitzhak Khabinsky 26,586 Reputation points

2021-05-10T14:56:13.477+00:00

@SQL9 ,

Did you have a chance to try the proposed solution?
EchoLiu-MSFT 14,621 Reputation points

2021-05-13T09:25:05.943+00:00

Do you have any update?
EchoLiu-MSFT 14,621 Reputation points

2021-05-14T09:39:21.67+00:00

Please also remember to accept the answers if they helped.
Your action would be helpful to other users who encounter the same issue and read this thread.

Accepted answer

3 additional answers

Your answer

Yitzhak Khabinsky 26,586 Reputation points

2021-05-10T14:56:13.477+00:00

@SQL9 ,

Did you have a chance to try the proposed solution?
EchoLiu-MSFT 14,621 Reputation points

2021-05-13T09:25:05.943+00:00

Do you have any update?
EchoLiu-MSFT 14,621 Reputation points

2021-05-14T09:39:21.67+00:00

Please also remember to accept the answers if they helped.
Your action would be helpful to other users who encounter the same issue and read this thread.

Answer 1

Hi @SQL9 ,

Please try the following solution.
It is pure T-SQL and will work starting from SQL Server 2017 onwards.
It is up to you to package it as a user defined function.

I separated the entire process into two steps:

words cleansing.
chars cleansing.

SQL

-- DDL and sample data population, start  
DECLARE @titles TABLE  (ID INT IDENTITY PRIMARY KEY, title NVARCHAR(50));  
INSERT INTO @titles (title) VALUES   
(' & '),('&'),('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'Dr.'),  
('corporate'),( 'llp'),( 'llc'),( 'ltd'),('Inc.'),( 'limited'),( 'the'),( 'dr'),( 'md'),  
('rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),( '3rd'),( 'aprn'),  
('pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),  
('cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' ),('.'),(','),('-');  
  
DECLARE @badChars NVARCHAR(100) = '0123456789&.,-';  
-- DDL and sample data population, end  
  
DECLARE @inputParameter NVARCHAR(100) = N'A-Z Needs, Inc.';  
DECLARE @output NVARCHAR(100);  
  
-- Step #1, cleanse words  
SELECT @output = STRING_AGG(cleansedWords, SPACE(1))  
FROM (SELECT value FROM STRING_SPLIT(@inputParameter, SPACE(1))  
 WHERE value NOT IN (  
 SELECT title FROM @titles)) AS t(cleansedWords);  
  
-- test, cleansed words  
SELECT @inputParameter, @output;  
  
-- Step #2, cleanse chars  
SELECT @output = TRIM((SELECT STRING_AGG(REPLACE(TRANSLATE(value, @badChars, SPACE(LEN(@badChars))),SPACE(1),''), SPACE(1))  
 FROM STRING_SPLIT(@output, SPACE(1))  
 ));  
  
-- test, cleansed chars  
SELECT @output AS Result;

Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-05-09T18:48:50.797+00:00

On the sample points that SQL9 has is Dr. Ray Wed, which your code does not cope with. Possibly you could fix that by first swapping the two steps. But my experience tells me that this sort of problem requires regular expression.

There is also an issue with names like Limited Resources Inc.. limited is on SQL9's list, but in this fictional example you want to retain that word.
Yitzhak Khabinsky 26,586 Reputation points

2021-05-09T19:12:50.753+00:00

@Erland Sommarskog ,

Thanks. Good catch.
I saw it too. My expectations were that the OP will add a full list of the forbidden words to the titles table.
SQL9 246 Reputation points

2021-05-14T16:47:11.82+00:00

Hi YitzhakKhabinsky-0887 - Right now this solution worked for my current scenario as expected. I appreciate your help working on this.
I spoke to my client and they are satisfied with it. They are not expecting 100% accurate results when we use this function because like ErlandSommarskog mentioned there could be lot of other scenarios where we could miss some words or chars but that is acceptable in my current scenario.

Thanks,
RH

Answer 2

EchoLiu-MSFT 14,621

Hi @SQL9 ,

As ErlandSommarskog said, tsql alone seems to be difficult to achieve your needs. Nested replacement may be possible, but that efficiency is not necessarily higher than your existing cursor.

Regards
Echo

Yitzhak Khabinsky 26,586 Reputation points

2021-05-10T14:58:11.267+00:00

@EchoLiu-MSFT ,

I hope you didn't miss the proposed solution.
It will work starting from SQL Server 2017 onwards.

Answer 3

With only the information you have given, it is going to be extremely difficult to do in any language.

If in fact you have other attributes, like address, you are likely better off using something like a white pages or USPS database to "standardize" your information. Something like https://www.smartystreets.com/. (I have used SmartyStreets in the past, but have no affiliation to them).

Without knowing more about how you would use this information, it is impossible to guess if that will work for your situation or not.

Answer 4

Let's get this clear from the start: there is no good T-SQL-only solution for this.

You can write a user-defined function where you loop over the Titles table and run a replace for each entry. If you say:

UPDATE tbl SET fullname = dbo.StripJunk(fullname)

SQL Server will essentially implement this as a cursor under the hood. You may have heard of inlining of scalar functions in SQL 2019, but because of the loop in the function it will not happen here.

You could isntead build a dynamic SQL statement with an insane amount of nested replace() calls. I would expect this to run faster.

However, both these solution has the problem that replace() does not understand word boundaries, so if you have Faultdata Inc., it will become Fauata . with your Titles table. Trying to handle the word boundaries in T-SQL? Loop over each character in the string? How efficient will that be? You don't want to go there.

It still possible to run this from inside SQL Server, though. There are two options:

Using the RegEx classes in .NET through a CLR function.
Calling an external language with sp_execute_external_script.

Using the CLR might be the most efficient solution. It also has the advantage that it works on all SQL Server versions from SQL 2005 and up. The disadvantage is that it comes with quite a bit of red tape, and more so in recent versions.

The second option became available in SQL 2016. In SQL 2016, the only available language is R. SQL 2017 added Python and SQL 2019 added Java. (And also exposed the mechanisms to add your own language.)

R is a specialised language for statistics, and I don't know whether there is support for regular expressions. Below is a solution that uses Python, so you cannot use it on SQL 2016. In the solution, I am only considering words, and handling punctuation characters are left as an exercise to the reader.

DROP TABLE IF EXISTS Titles, Data

CREATE TABLE Titles (title nvarchar(50));
INSERT INTO Titles (title)
VALUES ('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'corporate'),( 'llp'),( 'llc'),
( 'ltd'),( 'limited'),( 'the'),( 'dr'),( 'md'),( 'rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),
( '3rd'),( 'aprn'),( 'pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),( 'cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' )


CREATE TABLE Data (id int IDENTITY PRIMARY KEY, fullname nvarchar(100));
INSERT INTO Data (fullname) VALUES
('John Doe Corporate'),
('Mary Jones llp'),
('Doctor Whu ltd'),
('Leia & Organa MD'),
('A&O Office'),
('Dr.Ray Wed'),
('James Adams Jr.'),
('William Duke iii'),
('Mr.John King Sr.'),
('First Step, Inc.'),
('A-Z Needs, Inc.'),
('ABC 100 Sales, ltd')

DECLARE @pattern nvarchar(MAX)
SELECT @pattern = '\b(' + string_agg(title, '|') + '|\d+)\b(?i)'
FROM   Titles

DROP TABLE IF EXISTS #result
CREATE TABLE #result (id     int NOT NULL PRIMARY KEY,
                      result varchar(100) NULL)

INSERT #result(id, result)
 EXEC sp_execute_external_script @language = N'Python', 
     @input_data_1 = N'SELECT id, fullname FROM Data', 
     @output_data_1_name = N'Ret',
     @script = N'import re, pandas
Ret = InputDataSet
Ret["fullname"] = pandas.Series([re.sub(Pattern, "", i) for i in Ret["fullname"]])
', 
     @params = N'@Pattern nvarchar(MAX)',
     @Pattern = @pattern

UPDATE Data
SET    fullname = isnull(r.result, '')
FROM   Data d
JOIN   #result r ON d.id = r.id
WHERE  d.fullname <> isnull(r.result, '')

SELECT * FROM Data

In the end, you may prefer to read the data to a client-side program to do the processing there and write back.

Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-05-09T12:39:33.977+00:00

I should add that for the solution above to work, three things need to be fulfilled:

The Python support must be available in the instance. It's an optional component when you install SQL Server.

The server configuration parameter external scripts enabled must be 1.

The user must have the permission EXECUTE EXTERNAL SCRIPT.
SQL9 246 Reputation points

2021-05-09T14:41:31.163+00:00

Thank you ErlandSommarskog for taking time and reply back with solutions. Unfortunately client would not go with python or R script because nobody knows about how to code them and support down the line. They need a regular t-sql as a solution. We already have cursor which is taking long time just to return clean full name. I appreciate your help.
Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2021-05-09T16:00:30.51+00:00

That's understandable, and I agree with not adding technology which is new and unknown only for the sake of it. Or for that matter even if the solution makes perfect sense in this case. If no one understands the setup and configuration, it will be too brittle.

So that leaves you with a single option: Write a client-side program that fits in with the current infrastructure and mindset. But don't do it T-SQL. T-SQL is a good hammer for processing data in general. But this problem requires a screwdriver.

Share via

Need T-sql User Defined Function to remove specified words or chars from a table field.

3 additional answers

Your answer