.NET Framework 정규식

2011-08-09

정규식은 텍스트 처리를 위한 강력하고 효과적이며 융통성 있는 방법을 제공합니다. 정규식의 광범위한 패턴 일치 표기법을 사용하면 많은 양의 텍스트를 신속히 구문 분석하여 특정 문자 패턴을 찾을 수 있고, 유효성 검사를 통해 텍스트가 미리 정의된 패턴(예: 전자 메일 주소)과 일치하는지 확인하고, 텍스트 부분 문자열을 추출, 편집, 바꾸기 또는 삭제하고, 추출된 문자열을 컬렉션에 추가하여 보고서를 생성할 수 있습니다. 문자열을 다루거나 큰 텍스트 블록을 구문 분석하는 여러 응용 프로그램에서 정규식은 반드시 필요한 도구입니다.

정규식 작동 방법

정규식을 사용한 텍스트 처리의 핵심은 정규식 엔진으로 .NET Framework에서는 System.Text.RegularExpressions.Regex 개체로 나타납니다. 정규식을 사용하여 텍스트를 처리하려면 최소한 다음 두 가지 정보 항목과 정규식 엔진이 제공되어야 합니다.

텍스트에서 식별할 정규식 패턴

.NET Framework에서 정규식 패턴은 특수 구문 또는 언어로 정의되며, 이 언어는 Perl 5 정규식과 호환되고 오른쪽에서 왼쪽으로 일치 검사와 같은 일부 기능을 추가합니다. 자세한 내용은 정규식 언어 요소를 참조하십시오.
정규식 패턴으로 구문 분석할 텍스트

Regex 클래스의 메서드를 통해 다음 작업을 수행할 수 있습니다.

IsMatch 메서드를 호출하여 입력 텍스트에 정규식 패턴이 있는지 여부를 확인합니다. IsMatch 메서드를 사용하여 텍스트의 유효성을 검사하는 예제를 보려면 방법: 문자열이 올바른 전자 메일 형식인지 확인을 참조하십시오.
Match 또는 Matches 메서드를 호출하여 정규식 패턴과 일치하는 텍스트를 한 번 검색하거나 모두 검색합니다. 첫 번째 메서드는 일치하는 텍스트에 대한 정보를 제공하는 Match 개체를 반환합니다. 두 번째 메서드는 구문 분석한 텍스트에서 검색된 각 일치 항목에 대해 하나의 Match 개체가 들어 있는 MatchCollection 개체를 반환합니다.
Replace 메서드를 호출하여 정규식 패턴과 일치하는 텍스트를 바꿉니다. Replace 메서드를 사용하여 날짜 서식을 변경하고 문자열에서 잘못된 문자를 제거하는 예제를 보려면 방법: 문자열에서 유효하지 않은 문자 제거 및 예제: 날짜 형식 변경을 참조하십시오.

정규식 개체 모델에 대한 개요는 Regular Expression 개체 모델를 참조하십시오.

정규식 예제

String 클래스에는 더 큰 문자열에서 리터럴 문자열을 찾으려는 경우에 사용할 수 있는 많은 문자열 검색 및 대체 메서드가 포함되어 있습니다. 정규식은 다음 예제와 같이 더 큰 문자열에서 여러 부분 문자열 중 하나를 찾으려는 경우나 문자열의 패턴을 식별하려는 경우에 가장 유용합니다.

예제 1: 부분 문자열 대체

메일 그룹에 가끔씩 성과 이름뿐 아니라 Mr., Mrs., Miss 또는 Ms. 같은 호칭이 포함된 이름이 들어 있는 경우 이 메일 그룹의 봉투 레이블을 생성할 때 호칭을 제외하려면 다음 예제와 같이 정규식을 사용하여 호칭을 제거할 수 있습니다.

Imports System.Text.RegularExpressions

Module Example
   Public Sub Main()
      Dim pattern As String = "(Mr\.? |Mrs\.? |Miss |Ms\.? )"
      Dim names() As String = { "Mr. Henry Hunt", "Ms. Sara Samuels", _
                                "Abraham Adams", "Ms. Nicole Norris" }
      For Each name As String In names
         Console.WriteLine(Regex.Replace(name, pattern, String.Empty))
      Next                                
   End Sub
End Module
' The example displays the following output:
'    Henry Hunt
'    Sara Samuels
'    Abraham Adams
'    Nicole Norris

using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = "(Mr\\.? |Mrs\\.? |Miss |Ms\\.? )";
      string[] names = { "Mr. Henry Hunt", "Ms. Sara Samuels", 
                         "Abraham Adams", "Ms. Nicole Norris" };
      foreach (string name in names)
         Console.WriteLine(Regex.Replace(name, pattern, String.Empty));
   }
}
// The example displays the following output:
//    Henry Hunt
//    Sara Samuels
//    Abraham Adams
//    Nicole Norris

정규식 패턴 (Mr\.? |Mrs\.? |Miss |Ms\.? )는 "Mr ", "Mr. " , "Mrs ", "Mrs. " , "Miss ", "Ms 또는 "Ms. "를 검색합니다. . Regex.Replace 메서드를 호출하면 일치하는 문자열이 String.Empty로 바뀝니다. 즉, 일치하는 문자열이 원래 문자열에서 제거됩니다.

예제 2: 중복된 단어 식별

실수로 단어를 중복하는 것은 작성자가 범하는 일반적인 오류입니다. 다음 예제와 같이 정규식을 사용하여 중복된 단어를 식별할 수 있습니다.

Imports System.Text.RegularExpressions

Module modMain
   Public Sub Main()
      Dim pattern As String = "\b(\w+?)\s\1\b"
      Dim input As String = "This this is a nice day. What about this? This tastes good. I saw a a dog."
      For Each match As Match In Regex.Matches(input, pattern, RegexOptions.IgnoreCase)
         Console.WriteLine("{0} (duplicates '{1})' at position {2}", _
                           match.Value, match.Groups(1).Value, match.Index)
      Next
   End Sub
End Module
' The example displays the following output:
'       This this (duplicates 'This)' at position 0
'       a a (duplicates 'a)' at position 66

using System;
using System.Text.RegularExpressions;

public class Class1
{
   public static void Main()
   {
      string pattern = @"\b(\w+?)\s\1\b";
      string input = "This this is a nice day. What about this? This tastes good. I saw a a dog.";
      foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnoreCase))
         Console.WriteLine("{0} (duplicates '{1})' at position {2}", 
                           match.Value, match.Groups[1].Value, match.Index);
   }
}
// The example displays the following output:
//       This this (duplicates 'This)' at position 0
//       a a (duplicates 'a)' at position 66

정규식 패턴 \b(\w+?)\s\1\b는 다음과 같이 해석될 수 있습니다.

\b	단어 경계를 시작합니다.
(\w+)	하나 이상의 단어 문자를 찾습니다. 이러한 문자는 한데 모여 \1이라는 하나의 그룹을 형성합니다.
\s	공백 문자를 찾습니다.
\1	\1이라는 그룹과 동일한 부분 문자열을 찾습니다.
\b	단어 경계를 찾습니다.

정규식 옵션이 RegexOptions.IgnoreCase로 설정되어 있을 경우 Regex.Matches 메서드가 호출됩니다. 따라서 일치 작업에서 대/소문자를 구분하지 않기 때문에 이 예제에서는 부분 문자열 "This this"를 중복된 단어로 식별합니다.

입력 문자열에는 부분 문자열 "this. This"도 포함되어 있지만 두 단어 사이에 문장 부호가 있기 때문에 이 문자열은 중복된 단어로 식별되지 않습니다.

예제 3: 동적으로 문화권을 구분하는 정규식 만들기

다음 예제에서는 정규식의 기능과 .NET Framework 전역화 기능의 유연성을 함께 사용하는 방법을 보여 줍니다. 이 예제에서는 NumberFormatInfo 개체를 사용하여 시스템의 현재 문화권에 사용되는 통화 값의 형식을 확인합니다. 그런 다음 해당 정보를 사용하여 텍스트에서 통화 값을 추출하는 정규식을 동적으로 생성합니다. 일치하는 각 항목에 대해 숫자 문자열만 들어 있는 하위 그룹을 추출하여 Decimal 값으로 변환한 다음 누계를 계산합니다.

Imports System.Collections.Generic
Imports System.Globalization
Imports System.Text.RegularExpressions

Public Module Example
   Public Sub Main()
      ' Define text to be parsed.
      Dim input As String = "Office expenses on 2/13/2008:" + vbCrLf + _
                            "Paper (500 sheets)                      $3.95" + vbCrLf + _
                            "Pencils (box of 10)                     $1.00" + vbCrLf + _
                            "Pens (box of 10)                        $4.49" + vbCrLf + _
                            "Erasers                                 $2.19" + vbCrLf + _
                            "Ink jet printer                        $69.95" + vbCrLf + vbCrLf + _
                            "Total Expenses                        $ 81.58" + vbCrLf
      ' Get current culture's NumberFormatInfo object.
      Dim nfi As NumberFormatInfo = CultureInfo.CurrentCulture.NumberFormat
      ' Assign needed property values to variables.
      Dim currencySymbol As String = nfi.CurrencySymbol
      Dim symbolPrecedesIfPositive As Boolean = CBool(nfi.CurrencyPositivePattern Mod 2 = 0)
      Dim groupSeparator As String = nfi.CurrencyGroupSeparator
      Dim decimalSeparator As String = nfi.CurrencyDecimalSeparator

      ' Form regular expression pattern.
      Dim pattern As String = Regex.Escape(CStr(IIf(symbolPrecedesIfPositive, currencySymbol, ""))) + _
                              "\s*[-+]?" + "([0-9]{0,3}(" + groupSeparator + "[0-9]{3})*(" + _
                              Regex.Escape(decimalSeparator) + "[0-9]+)?)" + _
                              CStr(IIf(Not symbolPrecedesIfPositive, currencySymbol, "")) 
      Console.WriteLine("The regular expression pattern is: ")
      Console.WriteLine("   " + pattern)      

      ' Get text that matches regular expression pattern.
      Dim matches As MatchCollection = Regex.Matches(input, pattern, RegexOptions.IgnorePatternWhitespace)               
      Console.WriteLine("Found {0} matches. ", matches.Count)

      ' Get numeric string, convert it to a value, and add it to List object.
      Dim expenses As New List(Of Decimal)

      For Each match As Match In matches
         expenses.Add(Decimal.Parse(match.Groups.Item(1).Value))      
      Next

      ' Determine whether total is present and if present, whether it is correct.
      Dim total As Decimal
      For Each value As Decimal In expenses
         total += value
      Next

      If total / 2 = expenses(expenses.Count - 1) Then
         Console.WriteLine("The expenses total {0:C2}.", expenses(expenses.Count - 1))
      Else
         Console.WriteLine("The expenses total {0:C2}.", total)
      End If   
   End Sub
End Module
' The example displays the following output:
'       The regular expression pattern is:
'          \$\s*[-+]?([0-9]{0,3}(,[0-9]{3})*\.?[0-9]+)
'       Found 6 matches.
'       The expenses total $81.58.

using System;
using System.Collections.Generic;
using System.Globalization;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      // Define text to be parsed.
      string input = "Office expenses on 2/13/2008:\n" + 
                     "Paper (500 sheets)                      $3.95\n" + 
                     "Pencils (box of 10)                     $1.00\n" + 
                     "Pens (box of 10)                        $4.49\n" + 
                     "Erasers                                 $2.19\n" + 
                     "Ink jet printer                        $69.95\n\n" + 
                     "Total Expenses                        $ 81.58\n"; 

      // Get current culture's NumberFormatInfo object.
      NumberFormatInfo nfi = CultureInfo.CurrentCulture.NumberFormat;
      // Assign needed property values to variables.
      string currencySymbol = nfi.CurrencySymbol;
      bool symbolPrecedesIfPositive = nfi.CurrencyPositivePattern % 2 == 0;
      string groupSeparator = nfi.CurrencyGroupSeparator;
      string decimalSeparator = nfi.CurrencyDecimalSeparator;

      // Form regular expression pattern.
      string pattern = Regex.Escape( symbolPrecedesIfPositive ? currencySymbol : "") + 
                       @"\s*[-+]?" + "([0-9]{0,3}(" + groupSeparator + "[0-9]{3})*(" + 
                       Regex.Escape(decimalSeparator) + "[0-9]+)?)" + 
                       (! symbolPrecedesIfPositive ? currencySymbol : ""); 
      Console.WriteLine( "The regular expression pattern is:");
      Console.WriteLine("   " + pattern);      

      // Get text that matches regular expression pattern.
      MatchCollection matches = Regex.Matches(input, pattern, 
                                              RegexOptions.IgnorePatternWhitespace);               
      Console.WriteLine("Found {0} matches.", matches.Count); 

      // Get numeric string, convert it to a value, and add it to List object.
      List<decimal> expenses = new List<Decimal>();

      foreach (Match match in matches)
         expenses.Add(Decimal.Parse(match.Groups[1].Value));      

      // Determine whether total is present and if present, whether it is correct.
      decimal total = 0;
      foreach (decimal value in expenses)
         total += value;

      if (total / 2 == expenses[expenses.Count - 1]) 
         Console.WriteLine("The expenses total {0:C2}.", expenses[expenses.Count - 1]);
      else
         Console.WriteLine("The expenses total {0:C2}.", total);
   }  
}
// The example displays the following output:
//       The regular expression pattern is:
//          \$\s*[-+]?([0-9]{0,3}(,[0-9]{3})*\.?[0-9]+)
//       Found 6 matches.
//       The expenses total $81.58.

현재 문화권이 미국 영어(en-US)인 컴퓨터에서 이 예제를 실행하면 \$\s*[-+]?([0-9]{0,3}(,[0-9]{3})*(\.[0-9]+)?) 정규식이 동적으로 만들어집니다. 이 정규식 패턴은 다음과 같이 해석할 수 있습니다.

\$	입력 문자열에서 달러 기호($)를 1개 찾습니다. 정규식 패턴 문자열에 들어 있는 백슬래시는 달러 기호를 정규식 앵커 대신 리터럴로 해석하도록 지시합니다. 단독으로 사용되는 $ 기호는 정규식 엔진이 문자열의 끝에서 일치하는 항목을 찾기 시작한다는 것을 나타냅니다. 현재 문화권의 통화 기호가 정규식 기호로 잘못 해석되지 않도록 하기 위해 이 예제에서는 Escape 메서드를 사용하여 문자를 이스케이프합니다.
\s*	공백 문자를 0개 이상 찾습니다.
[-+]?	양수 기호 또는 음수 기호를 0개 또는 1개 찾습니다.
([0-9]{0,3}(,[0-9]{3})*(\.[0-9]+)?)	이 표현식을 둘러싼 외부 괄호는 표현식을 캡처링 그룹 또는 하위 식으로 정의합니다. 일치하는 항목이 있으면 일치하는 문자열의 이 부분에 대한 정보를 Match.Groups 속성에서 반환된 GroupCollection 개체의 두 번째 Group 개체에서 검색할 수 있습니다. 컬렉션에서 첫 번째 요소는 일치하는 전체 문자열을 나타냅니다.
[0-9]{0,3}	10진수(0-9)를 최대 3개 찾습니다.
(,[0-9]{3})*	숫자 세 개가 뒤에 오는 그룹 구분 기호를 0개 이상 찾습니다.
\.	소수 구분 기호를 1개 찾습니다.
[0-9]+	하나 이상의 소수 구분 기호를 찾습니다.
(\.[0-9]+)?	하나 이상의 숫자가 뒤에 오는 소수 구분 기호를 0개 또는 1개 찾습니다.

입력 문자열에 이러한 각 하위 패턴이 있으면 일치하는 내용을 찾는 데 성공하게 되어 일치하는 항목에 대한 정보가 들어 있는 Match 개체가 MatchCollection 개체에 추가됩니다.

제목	설명
정규식 언어 요소	정규식을 정의하는 데 사용할 수 있는 문자, 연산자 및 구문 집합에 대해 자세히 설명합니다.
.NET Framework의 정규식에 대한 유용한 정보	정규식 성능을 최적화하고 강력하고 신뢰할 수 있는 정규식 패턴을 만들기 위한 권장 사항을 제공합니다.
Regular Expression 개체 모델	정규식 클래스의 사용 방법에 관한 내용과 코드 예제를 제공합니다.
정규식 동작 정보	.NET Framework 정규식의 기능 및 동작에 대해 설명합니다.
정규식 예제	정규식의 일반적인 용도를 보여 주는 코드 예제를 제공합니다.

참조

System.Text.RegularExpressions

System.Text.RegularExpressions.Regex