SQL Server 2005
Regular Expressions Make Pattern Matching And Data Extraction Easier
David Banister
This article discusses:
|
This article uses the following technologies: SQL Server 2005, .NET Framework |
Code download available at: Regex 2007_02.exe(154 KB)
Contents
CLR User-Defined Functions
Pattern Matching
Data Extraction
Pattern Storage
Matches
Data Extraction within Matches
Conclusion
Although T-SQL is extremely powerful for most data processing, it provides little support for text analysis or manipulation. Attempting to perform any sophisticated text analysis using the built-in string functions results in massively large functions and stored procedures that are difficult to debug and maintain. Is there a better way?
In fact, regular expressions provide a much more efficient and elegant solution. It is easy to see how they would be helpful for comparing text to identify records, but there is so much more they can do. I'll show you how to perform a variety of tasks, some simple and some amazing, that were considered impractical or impossible in SQL Server™ 2000, but are now possible in SQL Server 2005 due to its support for hosting the common language runtime (CLR).
Regular expressions are not new to SQL. Oracle introduced built-in regular expressions in 10g, and many open source database solutions use some kind of regular expressions library. Regular expressions could actually be used in earlier versions of SQL Server, but the process was inefficient.
Using the sp_OACreate stored procedure, any OLE automation object that implemented regular expressions could be used, but you had to create a COM object first, then make at least one IDispatch call, then destroy the object. For most purposes this was too inefficient and caused too many performance problems. The only alternative was to create an extended stored procedure. However, now there's SQLCLR, a CLR user-defined function (UDF) that lets you create an efficient and less error-prone set of functions using the Microsoft® .NET Framework.
CLR User-Defined Functions
CLR user-defined functions are simply static methods (shared functions in Visual Basic) defined within a .NET assembly. To use the SQLCLR objects, you must register the assembly with SQL Server using the new CREATE ASSEMBLY statement and then create each object pointing to its implementation within the assembly. For functions, the CREATE FUNCTION statement has been extended to support the creation of CLR user-defined functions. To make things easier, Visual Studio® 2005 takes care of all of the registration processes on your behalf when you use a SQL Server Project. This kind of project is different than most Visual Studio projects because when you attempt to debug (or start without debugging), the project is recompiled and the resulting assembly, as well as all of the SQLCLR objects defined within it, are then deployed to and registered with SQL Server. The IDE then runs the test script designated for the project. Breakpoints can be set in both the SQL script and in your .NET code, making debugging a simple process.
Adding a function is just like adding a new class to any other project type. Simply add a new item to the project and select User-Defined Function when prompted. The new method is added to the partial class that contains all of your functions. Your new method will also have a SqlFunction attribute applied to it. It is used by Visual Studio to create the SQL statements necessary to register the function. The IsDeterministic, IsPrecise, DataAccess, and SystemDataAccess fields on SqlFunction are also used by SQL Server for a variety of purposes.
Pattern Matching
Determining if a string matches a pattern is the simplest use of regular expressions and, as you see in Figure 1, it's, easy to do.
Figure 1 String Matching
public static partial class UserDefinedFunctions
{
public static readonly RegexOptions Options =
RegexOptions.IgnorePatternWhitespace |
RegexOptions.Singleline;
[SqlFunction]
public static SqlBoolean RegexMatch(
SqlChars input, SqlString pattern)
{
Regex regex = new Regex( pattern.Value, Options );
return regex.IsMatch( new string( input.Value ) );
}
}
First I use the Options field to store the regular expression options for the functions. In this case, I've selected RegexOptions.SingleLine and RegexOptions.IgnorePatternWhitespace. The former specifies single-line mode, and the latter eliminates unescaped whitespace from the regular expression and enables comments marked with a pound sign. Another option you might want to use after careful thought and analysis is RegexOption.Compiled. As long as there aren't too many, you will see significant performance gain if you use Compiled on expressions that are heavily used. Expressions that are used over and over again should definitely be compiled. However, for regular expressions that are rarely used, do not use Compiled as it causes increased startup costs and memory overhead. As such, you might want to augment my general purpose RegexMatch function with an additional parameter that specifies whether you want the expression to be compiled; that way, you can decide on a case by case basis whether the additional overhead will be worth the resulting performance gains.
After specifying the RegexOptions to be used, I define the RegexMatch function, using the SqlChars data type instead of SqlString. The SqlString data type translates into nvarchar(4,000) while SqlChars translates into nvarchar(max). The new max size functionality allows strings to extend beyond the 8,000 byte limit of SQL Server 2000. Throughout this article, I use nvarchar(max) to be as generic as possible and to allow for the most flexibility. However, performance can be significantly better using nvarchar(4,000) if all the relevant strings contain less than 4,000 characters. You should examine your specific needs and code to them appropriately.
The remaining code in the method is simple. A Regex instance is created with the defined options and the provided pattern and the IsMatch method is then used to determine if the specified input matches the pattern. Now you need to add a simple query to the test script:
select dbo.RegexMatch( N'123-45-6789', N'^\d{3}-\d{2}-\d{4}$' )
The pattern in this statement is a simple test for a US Social Security Number. Set a breakpoint on the new query and then start debugging to step through the function. This function lets you do many different tests, but I'll show you some things most people don't consider. For example, it's very important to maintain consistent naming conventions within a database, and writing a query to validate that all of your stored procedures meet your organization's guidelines is difficult. The RegexMatch function makes this task much simpler. For example, the following query test performs this task:
select ROUTINE_NAME
from INFORMATION_SCHEMA.ROUTINES
where ROUTINE_TYPE = N'PROCEDURE'
and dbo.RegexMatch( ROUTINE_NAME,
N'^usp_(Insert|Update|Delete|Select)([A-Z][a-z]+)+$' ) = 0
This query tests that every stored procedure is prefixed with "usp_" followed by "Insert", "Update", "Delete", or "Select", followed by at least one entity name. It also verifies that each word in the entity begins with a capital letter. Compare those four lines with this oversimplified version using only built-in functions:
select ROUTINE_NAME
from INFORMATION_SCHEMA.ROUTINES
where ROUTINE_TYPE = N'PROCEDURE'
and ( LEN( ROUTINE_NAME ) < 11
or LEFT( ROUTINE_NAME, 4 ) <> N'usp_'
or SUBSTRING( ROUTINE_NAME, 5, 6 ) not in
( N'Insert', N'Update', N'Delete', N'Select' ) )
Even though this is more code, this query is actually missing several features present in the regular expressions version. First, it is not case-sensitive, and using collations within the query to perform the tests would make it unruly. Second, it does not perform any tests on the actual entity name contained within the procedure name. The third problem is the four strings tested in the query are all six characters long, which allowed me to simplify the code by pulling out a single substring of six characters that I could then compare against each of the acceptable operations. This is not problematic specifically in this example, since all of the operation names are six characters long, but imagine a standard that specifies more complex verbs like "Get", "List", or "Find". Those verbs are easily handled by the RegexMatch function, because they are just additional alternatives in the list.
Validation is a very common use of regular expressions, for anything from a phone number to a ZIP code to a custom account number format. The CHECK constraint is perfect for this, as the following table definition shows.
CREATE TABLE [Account]
(
[AccountNumber] nvarchar(20) CHECK (dbo.RegexMatch(
[AccountNumber], '^[A-Z]{3,5}\d{5}-\d{3}$' ) = 1),
[PhoneNumber] nchar(13) CHECK (dbo.RegexMatch(
[PhoneNumber], '^\(\d{3}\)\d{3}-\d{4}$' ) = 1),
[ZipCode] nvarchar(10) CHECK (dbo.RegexMatch(
[ZipCode], '^\d{5}(\-\d{4})?$' ) = 1)
)
The AccountNumber column is validated against an arbitrary convention that begins with three to five letters followed by five numbers, then a dash, and finally three more numbers. Both the phone number and ZIP codes are validated against standard US phone number and ZIP code formats. The RegexMatch function provides many features to SQL Server, but the regular expressions implementation in .NET provides much more, as you'll see next.
Data Extraction
The grouping features of regular expressions can be used to extract data from a string. My RegexGroup function provides that functionality to T-SQL:
[SqlFunction]
public static SqlChars RegexGroup(
SqlChars input, SqlString pattern, SqlString name )
{
Regex regex = new Regex( pattern.Value, Options );
Match match = regex.Match( new string( input.Value ) );
return match.Success ?
new SqlChars( match.Groups[name.Value].Value ) : SqlChars.Null;
}
This function creates a Regex object just as the RegexMatch function did. Instead of testing for a match, however, a Match object is created for the first match found in the input string. The Match object is used to retrieve the group that is specified. If no match is found within the input, a null value is returned. If you prefer to use numbered groups instead of named groups, this function will still work for you. Just pass an integer value to the function within your SQL code and it will be implicitly cast to an nvarchar and the appropriate group will be returned.
You can use the RegexGroup function within a SELECT list to extract specific pieces of information from some other piece of data. For example, if you had a column that stored a URL, you can now easily parse the URL to determine individual pieces. This query uses grouping to determine every distinct server stored in the Url column of the UrlTable table.
select distinct dbo.RegexGroup( [Url],
N'https?://(?<server>([\w-]+\.)*[\w-]+)', N'server' )
from [UrlTable]
You can also use this function within computed columns. The following table definition divides e-mail addresses into the mailbox and the domain.
CREATE TABLE [Email]
(
[Address] nvarchar(max),
[Mailbox] as dbo.RegexGroup( [Address],
N'(?<mailbox>[^@]*)@', N'mailbox' ),
[Domain] as dbo.RegexGroup( [Address], N'@(?<domain>.*)', N'domain' )
The mailbox column will return the mailbox or username of the e-mail address. The domain column will return the domain of the e-mail address.
Pattern Storage
All of the patterns used by these functions are just strings, which means that any of them can be stored in a table within your database. Most databases that store international data have a table representing countries. By adding a few extra columns to that table, you could store country-specific validation patterns. That would allow the constraint applied to an address row to vary based on the country for that row.
In databases that store data on behalf of clients, there is typically already a table representing a client. That table can be used to store grouping patterns that let you describe the way raw client data is stored within the database, and this allows you to create computed columns to pull the data you actually need from the client data. For example, if each of your clients has unique schemes for account numbers and you only need specific pieces of that account number, you could easily create an expression that pulls the correct piece of information for each client.
Matches
Rather than determining if a string matches a pattern, it is sometimes desirable to extract every match. Previously this kind of extraction would require cursors iterating over sections of a string. Not only is that process slow, but the code is also difficult to understand and maintain. Regular expressions are a much better means of performing this operation. The problem at hand is how to return all of the required data within a SQL construct. Table-valued functions are the answer.
Table-valued functions are somewhat similar to the previous functions, but vary in two distinct ways. First, the attributes applied to the method must fully declare the structure of the table that is returned. Second, there are two methods involved. The first returns an enumerable object instead of the actual result of the function. The second method is passed the enumerated objects to populate the fields of each row. Each value retrieved via the enumerator should correspond with one row of the resultset. The ICollection interface in the .NET Framework implements IEnumerable which means that any collection can be returned by the first method. The Regex class contains a Matches method that returns a MatchCollection that you could use. The problem with the MatchCollection is that the entire string must be processed prior to the Matches method returning. SQL Server includes optimizations that depend on processing occurring as needed, so instead of returning the entire collection up front, I prefer to write my own enumerator that returns each match as requested. This decision really depends on how the function is used and should be heavily tested prior to optimizing the enumerator.
The code in Figure 2 shows the enumerator. The MatchNode class wraps an individual match in the string while tracking its position within the set of matches returned. The MatchIterator class is enumerable and handles the regular expression processing. It uses the new yield keyword to create the enumerator much more easily than previous versions of the framework. It will return each match detected within the input string as requested.
Figure 2 Custom Enumerable Object for Matches
internal class MatchNode
{
private int _index;
public int Index { get{ return _index; } }
private string _value;
public string Value { get { return _value; } }
public MatchNode( int index, string value )
{
_index = index;
_value = value;
}
}
internal class MatchIterator : IEnumerable
{
private Regex _regex;
private string _input;
public MatchIterator( string input, string pattern )
{
_regex = new Regex( pattern, UserDefinedFunctions.Options );
_input = input;
}
public IEnumerator GetEnumerator()
{
int index = 0;
Match current = null;
do
{
current = (current == null) ?
_regex.Match( _input ) : current.NextMatch( );
if (current.Success)
{
yield return new MatchNode( ++index, current.Value );
}
}
while (current.Success);
}
}
The code in Figure 3 defines the table-valued CLR UDF. The RegexMatches method returns a new MatchIterator. The SqlFunctionAttribute on the RegexMatches method also includes some additional properties. The TableDefinition property is set to the table definition of the function. The FillRowMethodName is set to the name of the method to call for each iteration of the enumerable object returned. In this case, that method is FillMatchRow.
Figure 3 Table-Valued CLR UDF for Matches
[SqlFunction( FillRowMethodName = "FillMatchRow",
TableDefinition = "[Index] int,[Text] nvarchar(max)" )]
public static IEnumerable RegexMatches(SqlChars input, SqlString pattern)
{
return new MatchIterator( new string( input.Value ), pattern.Value );
}
[SuppressMessage( "Microsoft.Design", "CA1021:AvoidOutParameters" )]
public static void FillMatchRow( object data,
out SqlInt32 index, out SqlChars text )
{
MatchNode node = (MatchNode)data;
index = new SqlInt32( node.Index );
text = new SqlChars( node.Value.ToCharArray( ) );
}
For each iteration of the MatchIterator, a MatchNode is passed to the FillMatchRow method as its first argument. The remaining parameters of the FillMatchRow method must be declared as out parameters and must match the table definition defined in the first function. The FillMatchRow function simply uses the properties of the MatchNode to populate the field data.
With this function you can finally extract multiple pieces of data from a string with ease. To illustrate the use of the RegexMatches function, let's process a string to determine how many distinct words are contained within it using this query:
declare @text nvarchar(max), @pattern nvarchar(max)
select
@text = N'Here are four words.',
@pattern = '\w+'
select count(distinct [Text])
from dbo.RegexMatches( @text, @pattern )
This example is rather straightforward. It shows some potential for using the function but by removing the distinct keyword, it returns the total word count of a string. There are many Web sites that limit text entry to what seems like an arbitrary length. With this kind of test combined with the new nvarchar(max) notation, it becomes possible to limit input to a word count instead. This kind of query can be used for various analytic processing needs, but the RegexMatches function can also be used for more common tasks. Unfortunately, this kind of query also represents an overzealous use of regular expressions. The splitting operation accomplished by the "\w+" expression in this case could be just as easily accomplished with the String.Split method, which would be much faster. Regular expressions are a very powerful tool, but do make sure when you use them that you're using them for a good reason: there might be simpler tools you can use for specific cases that would yield better performance.
I often see questions in the MSDN® forums about how to pass a list of values to a stored procedure. I have also seen various convoluted methods of parsing such a list into an actual list to determine the correlated records. The RegexMatches function provides a much cleaner approach.
declare @pattern nvarchar(max), @list nvarchar(max)
select @pattern = N'[^,]+', @list = N'2,4,6'
select d.* from [Data] d
inner join dbo.RegexMatches( @list, @pattern ) re
on d.[ID] = re.[Text]
The pattern matches any group of characters not containing a comma. Given a table named Data with an integer column named ID, this query will return each record identified in the list. This becomes more useful when considering the implicit casting features within SQL Server. The same query can be used for integer, date/time, GUID, or floating-point data types. Other methods of processing a list of values would require multiple functions or stored procedures to be this flexible. This function can also be used for lists that are not comma-delimited. A list separated by spaces, semicolons, tabs, carriage returns, or any other identifiable character can be processed.
Data Extraction within Matches
Similar to returning matches, we can also extract data from each match. Attempting to do this using SQL is very difficult. Normally, this kind of task would be implemented within an application instead of the database, which causes problems because each application using the database would have to implement the required processing. In such scenarios, a reasonable approach might be to implement this functionality within stored procedures.
As with the RegexMatches implementation, I prefer to use a custom enumerable object to return group information. Grouping is only slightly more complicated because we also have to iterate over groups within each match. In Figure 4, the GroupNode class is just like the MatchNode class except that it also includes the name of group it represents. The GroupIterator class is similar to the MatchIterator class but includes an additional loop to return each group. Now that I have an enumerable object, I define a table-valued function just as I did with the RegexMatches function.
Figure 4 Custom Enumerable Object for Groups
internal class GroupNode
{
private int _index;
public int Index { get { return _index; } }
private string _name;
public string Name { get { return _name; } }
private string _value;
public string Value { get { return _value; } }
public GroupNode( int index, string group, string value )
{
_index = index;
_name = group;
_value = value;
}
}
internal class GroupIterator : IEnumerable
{
private Regex _regex;
private string _input;
public GroupIterator( string input, string pattern )
{
_regex = new Regex( pattern, UserDefinedFunctions.Options );
_input = input;
}
public IEnumerator GetEnumerator()
{
int index = 0;
Match current = null;
string[] names = _regex.GetGroupNames();
do
{
index++;
current = (current == null) ?
_regex.Match( _input ) : current.NextMatch( );
if (current.Success)
{
foreach(string name in names)
{
Group group = current.Groups[name];
if (group.Success)
{
yield return new GroupNode(
index, name, group.Value );
}
}
}
}
while(current.Success);
}
}
In Figure 5, the RegexGroups function is defined like the RegexMatches function except that it returns an additional column of data containing the name of the group within the match. With this function we can now find multiple matches within a string and extract specific pieces of the information from within each match.
Figure 5 Table-Valued CLR UDF for Groups
[SqlFunction( FillRowMethodName = "FillGroupRow", TableDefinition =
"[Index] int,[Group] nvarchar(max),[Text] nvarchar(max)" )]
public static IEnumerable
RegexGroups( SqlChars input, SqlString pattern )
{
return new GroupIterator( new string( input.Value ), pattern.Value );
}
[SuppressMessage( "Microsoft.Design", "CA1021:AvoidOutParameters" )]
public static void FillGroupRow( object data,
out SqlInt32 index, out SqlChars group, out SqlChars text )
{
GroupNode node = (GroupNode)data;
index = new SqlInt32( node.Index );
group = new SqlChars( node.Name.ToCharArray( ) );
text = new SqlChars( node.Value.ToCharArray( ) );
}
Importing data in various formats is a common task when dealing with databases. Importing files in a comma-delimited format is much more of a chore than it should be. Most developers create an application that processes each line, extracts the data, and then executes a stored procedure for each line. Although that process works, I'd like to propose another solution. What if you could pass the entire file to a stored procedure and let the stored procedure handle the entire process? This idea is usually considered too complicated to implement, but with the RegexGroups function you can actually perform this insert with a single query. For example, consider the following customer data.
2309478,Janet Leverling,J
2039748,Nancy Davolio,N
0798124,Andrew Fuller,M
4027392,Robert King,L
There are three different pieces of information you need from each line: the seven-digit customer number, the customer name, and the single character customer type. With the following expression you can extract all three pieces of information.
(?<CustomerNumber>\d{7}),(?<CustomerName>[^,]*),(?<CustomerType>[A-Z])\r?\n
The problem you now have is that the results returned by the RegexGroups function are not directly usable. Instead of using a cursor to iterate over the results, you can use the pivot functionality in SQL Server 2005. Putting all of this together into a stored procedure, you have everything you need. The stored procedure in Figure 6 accepts the text of an entire comma-delimited file containing up to 2GB of Unicode data. It processes the entire file and inserts each line of the file as a row into the Customer table. Any delimited text file could be processed the same way. With some small changes to the pattern, escape sequences could be added to support commas within strings.
Figure 6 Processing a Comma-Delimited File
create proc ImportCustomers
(
@file nvarchar(max)
)
as
declare @pattern nvarchar(max)
set @pattern = N'(?<CustomerNumber>\d{7}),
(?<CustomerName>[^,]*),(?<CustomerType>[A-Z])\r?\n'
insert [Customer]
(
[CustomerNumber],
[CustomerName],
[CustomerType]
)
select
f.[CustomerNumber],
f.[CustomerName],
f.[CustomerType]
from dbo.RegExGroups( @file, @pattern ) regex
pivot
(
max([Text])
for [Group]
in ( [CustomerNumber], [CustomerName], [CustomerType] )
) as f
Again, however, this procedure also demonstrates that there are multiple ways to do the same task, and sometimes regular expressions aren't always the best option. In this example, using a pivot is, in effect, undoing all the work that RegexGroups did to return the data in the special grouped format. One could insert the data directly into the table using a much simpler and faster TVF that just read each line, String.Split on commas, and returned each row.
Conclusion
Although these matching functions are very powerful, they are not complete. There are many possible options that determine the exact way matches are performed. If your database collation is case-insensitive, you may want the functions to perform their matches in a case-insensitive manner as well. The explicit capture option may be required to reduce some resultsets. The multi-line option can allow you to create more precise patterns for some tasks. You may even want to create a user-defined type to pass the exact required options to each of the functions, which would allow each execution of a function to use a different set of options.
You should also be aware that there are localization issues when processing text. For example, the .NET Framework Regex class is aware of many more characters than the Latin ones used in my examples, so care should be taken when developing patterns for databases that use international data.
And of course, as mentioned several times in the article, while regular expressions are immensely powerful, make sure you actually need that power. Some tasks can be performed faster and more simply with more basic tool sets.
For simplicity, the examples I provided lack validation and error handling, which should be included in any production system. Each of the inputs to the function should be validated and your requirements should determine the responses to null or empty string inputs. The Regex class can throw exceptions when the pattern can't be parsed or the options are invalid. Those exceptions should be handled gracefully.
Combining regular expressions with SQL provides many alternative means of processing data. Using these functions can reduce the amount of time required to add functionality to your database as well as make the system more maintainable. Any database could use regular expressions, and I recommend that you experiment with these functions to find new and even more creative uses.
David Banister is a Senior Software Developer for a top-4 accounting firm in Atlanta. He has been writing software for many years. In his spare time he enjoys reading ECMA language specifications, playing tennis, and helping local bands.