Let's get this clear from the start: there is no good T-SQL-only solution for this.
You can write a user-defined function where you loop over the Titles table and run a replace for each entry. If you say:
UPDATE tbl SET fullname = dbo.StripJunk(fullname)
SQL Server will essentially implement this as a cursor under the hood. You may have heard of inlining of scalar functions in SQL 2019, but because of the loop in the function it will not happen here.
You could isntead build a dynamic SQL statement with an insane amount of nested replace() calls. I would expect this to run faster.
However, both these solution has the problem that replace() does not understand word boundaries, so if you have Faultdata Inc.
, it will become Fauata .
with your Titles table. Trying to handle the word boundaries in T-SQL? Loop over each character in the string? How efficient will that be? You don't want to go there.
It still possible to run this from inside SQL Server, though. There are two options:
- Using the RegEx classes in .NET through a CLR function.
- Calling an external language with sp_execute_external_script.
Using the CLR might be the most efficient solution. It also has the advantage that it works on all SQL Server versions from SQL 2005 and up. The disadvantage is that it comes with quite a bit of red tape, and more so in recent versions.
The second option became available in SQL 2016. In SQL 2016, the only available language is R. SQL 2017 added Python and SQL 2019 added Java. (And also exposed the mechanisms to add your own language.)
R is a specialised language for statistics, and I don't know whether there is support for regular expressions. Below is a solution that uses Python, so you cannot use it on SQL 2016. In the solution, I am only considering words, and handling punctuation characters are left as an exercise to the reader.
DROP TABLE IF EXISTS Titles, Data
CREATE TABLE Titles (title nvarchar(50));
INSERT INTO Titles (title)
VALUES ('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'corporate'),( 'llp'),( 'llc'),
( 'ltd'),( 'limited'),( 'the'),( 'dr'),( 'md'),( 'rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),
( '3rd'),( 'aprn'),( 'pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),( 'cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' )
CREATE TABLE Data (id int IDENTITY PRIMARY KEY, fullname nvarchar(100));
INSERT INTO Data (fullname) VALUES
('John Doe Corporate'),
('Mary Jones llp'),
('Doctor Whu ltd'),
('Leia & Organa MD'),
('A&O Office'),
('Dr.Ray Wed'),
('James Adams Jr.'),
('William Duke iii'),
('Mr.John King Sr.'),
('First Step, Inc.'),
('A-Z Needs, Inc.'),
('ABC 100 Sales, ltd')
DECLARE @pattern nvarchar(MAX)
SELECT @pattern = '\b(' + string_agg(title, '|') + '|\d+)\b(?i)'
FROM Titles
DROP TABLE IF EXISTS #result
CREATE TABLE #result (id int NOT NULL PRIMARY KEY,
result varchar(100) NULL)
INSERT #result(id, result)
EXEC sp_execute_external_script @language = N'Python',
@input_data_1 = N'SELECT id, fullname FROM Data',
@output_data_1_name = N'Ret',
@script = N'import re, pandas
Ret = InputDataSet
Ret["fullname"] = pandas.Series([re.sub(Pattern, "", i) for i in Ret["fullname"]])
',
@params = N'@Pattern nvarchar(MAX)',
@Pattern = @pattern
UPDATE Data
SET fullname = isnull(r.result, '')
FROM Data d
JOIN #result r ON d.id = r.id
WHERE d.fullname <> isnull(r.result, '')
SELECT * FROM Data
In the end, you may prefer to read the data to a client-side program to do the processing there and write back.