Pattern matching and text extraction

Completed

SQL Server 2025 adds new regular expression (REGEXP) functions and enhances the SUBSTRING function to simplify working with structured text. These additions allow developers to perform flexible text searches, extract complex patterns, and clean or transform data directly in T-SQL without relying on client-side scripting or CLR functions.

Overview of Pattern Matching and Text Extraction Functions

The new REGEXP family of functions supports regular expression evaluation directly inside SQL Server. Each function enables specific pattern-based text operations.

REGEXP_LIKE

Checks if a string matches a regular expression pattern. Returns 1 if it matches, 0 otherwise.

REGEXP_SUBSTR

Extracts the first substring that matches a regular expression pattern. Useful for pulling out specific information such as phone numbers, dates, or email addresses.

REGEXP_REPLACE

Searches a string for a pattern and replaces all occurrences that match the pattern. It can be used for data cleanup and formatting.

REGEXP_INSTR

Returns the starting position of the first substring that matches the pattern within the given text. Ideal for locating key markers in structured text.

REGEXP_COUNT

Counts the number of matches of a regular expression pattern in a given string. This count is useful when you need to measure pattern frequency, such as counting digits, words, or symbols in text.

REGEXP_MATCHES

Returns all substrings that match a regular expression pattern as a table result. Use it when you need multiple captures from a single string rather than only the first match.

REGEXP_SPLIT_TO_TABLE

Splits a string into multiple rows using a regular expression delimiter.

SUBSTRING Enhancement

The SUBSTRING function now supports an optional length parameter. When omitted, it automatically extracts from the specified start position to the end of the string, reducing the need for manual LEN() calculations.

Together, these capabilities let you search, extract, and manipulate text patterns directly in SQL Server 2025 with concise, readable T-SQL.


Example Scenario: Extracting and Cleaning Contact Data

A marketing team maintains a database of customer messages in a column named MessageText. Many entries include phone numbers in different formats. You need to extract the first phone number from each message, normalize it to a standard format, and identify messages that contain invalid phone numbers.

Sample Data

Let's assume you have the following sample data in a table called CustomerMessages:

MessageID MessageText
1 "Call me at (713) 555-1298 or office 555-8811."
2 "Reach out to me: +1-832-555-7821 thank you!"
3 "My number is 713-555-9876 ext. 33"
4 "No phone listed yet."

Query: Identify, Extract, and Standardize

-- Extract the first phone number pattern and format it consistently
SELECT MessageID,
       REGEXP_SUBSTR(MessageText, '\d{3}[)\-\s]*\d{3}[\-\s]*\d{4}') AS RawNumber,
       REGEXP_REPLACE(
           REGEXP_SUBSTR(MessageText, '\d{3}[)\-\s]*\d{3}[\-\s]*\d{4}'),
           '\D', ''
       ) AS DigitsOnly,
       CASE 
           WHEN REGEXP_LIKE(MessageText, '\d{3}[)\-\s]*\d{3}[\-\s]*\d{4}') = 1 THEN 'Valid'
           ELSE 'Missing'
       END AS PhoneStatus
FROM dbo.CustomerMessages;

Results

MessageID RawNumber DigitsOnly PhoneStatus
1 (713) 555-1298 7135551298 Valid
2 +1-832-555-7821 18325557821 Valid
3 713-555-9876 7135559876 Valid
4 NULL NULL Missing

This example uses REGEXP_SUBSTR to extract the first matching pattern, REGEXP_REPLACE to strip non-numeric characters, and REGEXP_LIKE to verify valid numbers. The query standardizes phone numbers into a consistent digits-only format directly in T-SQL.


Example 2: Splitting Data with REGEXP_SPLIT_TO_TABLE and SUBSTRING

Suppose another table, CustomerFeedback, stores comma-separated tags describing customer interests. You want to separate them into individual rows and extract the first keyword for quick indexing.

SELECT FeedbackID,
       value AS Tag,
       SUBSTRING(value, 1) AS FirstWord
FROM CustomerFeedback
CROSS APPLY REGEXP_SPLIT_TO_TABLE(Tags, '\s*,\s*');

This query splits each comma-delimited string into rows while using the new SUBSTRING behavior (without specifying length) to extract the entire remaining text from the start position.

Results

FeedbackID Tag FirstWord
1 travel travel
1 photography photography
2 hiking hiking
2 camping camping

Summary

The new REGEXP functions and the enhanced SUBSTRING function in SQL Server 2025 deliver native pattern-matching and text-extraction capabilities. These additions eliminate the need for external string processing, making it easier to clean, parse, and analyze textual data inside the database engine. With these tools, developers can simplify ETL pipelines, improve data quality, and enable advanced text-driven analytics directly in T-SQL.