从字符串中提取子字符串

本文介绍了一些用于提取字符串各个部分的不同技术。

  • 当所需的子字符串由一个已知的分隔符(或多个分隔符)分隔时,请使用 Split 方法
  • 如果字符串符合某种固定模式,则可使用正则表达式
  • 如果不需要提取字符串中的所有子字符串,请结合使用 IndexOf 和 Substring 方法

String.Split 方法

String.Split 可提供少量重载,来根据指定的一个或多个分隔符将字符串分解为一组子字符串。 可以选择限制最终结果中子字符串的总数、剪裁子字符串中的空白字符或排除空子字符串。

下面的示例显示了三种不同的 String.Split() 重载。 第一个示例调用 Split(Char[]) 重载,而不传递任何分隔符。 如果未指定任何分隔符,String.Split() 将使用默认分隔符(空白字符)来拆分字符串。

string s = "You win some. You lose some.";

string[] subs = s.Split();

foreach (string sub in subs)
{
    Console.WriteLine($"Substring: {sub}");
}

// This example produces the following output:
//
// Substring: You
// Substring: win
// Substring: some.
// Substring: You
// Substring: lose
// Substring: some.
Dim s As String = "You win some. You lose some."
Dim subs As String() = s.Split()

For Each substring As String In subs
    Console.WriteLine("Substring: {0}", substring)
Next

' This example produces the following output:
'
' Substring: You
' Substring: win
' Substring: some.
' Substring: You
' Substring: lose
' Substring: some.

正如你所看到的那样,两个子字符串之间包含句点字符 (.)。 如果要排除句点字符,可以将句点字符添加为额外的分隔符。 下面的示例演示了如何执行此操作。

string s = "You win some. You lose some.";

string[] subs = s.Split(' ', '.');

foreach (string sub in subs)
{
    Console.WriteLine($"Substring: {sub}");
}

// This example produces the following output:
//
// Substring: You
// Substring: win
// Substring: some
// Substring:
// Substring: You
// Substring: lose
// Substring: some
// Substring:
Dim s As String = "You win some. You lose some."
Dim subs As String() = s.Split(" "c, "."c)

For Each substring As String In subs
    Console.WriteLine("Substring: {0}", substring)
Next

' This example produces the following output:
'
' Substring: You
' Substring: win
' Substring: some
' Substring:
' Substring: You
' Substring: lose
' Substring: some
' Substring:

子字符串之间的句点消息,但现在包含了两个额外的空子字符串。 这些空子字符串表示单词与紧跟单词之后的句点之间的子字符串。 若要从生成的数组中删除空字符串,可以调用 Split(Char[], StringSplitOptions) 重载,并为 options 参数指定 StringSplitOptions.RemoveEmptyEntries

string s = "You win some. You lose some.";
char[] separators = new char[] { ' ', '.' };

string[] subs = s.Split(separators, StringSplitOptions.RemoveEmptyEntries);

foreach (string sub in subs)
{
    Console.WriteLine($"Substring: {sub}");
}

// This example produces the following output:
//
// Substring: You
// Substring: win
// Substring: some
// Substring: You
// Substring: lose
// Substring: some
Dim s As String = "You win some. You lose some."
Dim separators As Char() = New Char() {" "c, "."c}
Dim subs As String() = s.Split(separators, StringSplitOptions.RemoveEmptyEntries)

For Each substring As String In subs
    Console.WriteLine("Substring: {0}", substring)
Next

' This example produces the following output:
'
' Substring: You
' Substring: win
' Substring: some
' Substring: You
' Substring: lose
' Substring: some

正则表达式

如果字符串符合某种固定模式,则可以使用正则表达式提取并处理其元素。 例如,如果字符串采用“数字+操作数+数字”格式,则可以使用正则表达式提取并处理字符串的元素。 下面是一个示例:

String[] expressions = { "16 + 21", "31 * 3", "28 / 3",
                       "42 - 18", "12 * 7",
                       "2, 4, 6, 8" };
String pattern = @"(\d+)\s+([-+*/])\s+(\d+)";

foreach (string expression in expressions)
{
    foreach (System.Text.RegularExpressions.Match m in
    System.Text.RegularExpressions.Regex.Matches(expression, pattern))
    {
        int value1 = Int32.Parse(m.Groups[1].Value);
        int value2 = Int32.Parse(m.Groups[3].Value);
        switch (m.Groups[2].Value)
        {
            case "+":
                Console.WriteLine("{0} = {1}", m.Value, value1 + value2);
                break;
            case "-":
                Console.WriteLine("{0} = {1}", m.Value, value1 - value2);
                break;
            case "*":
                Console.WriteLine("{0} = {1}", m.Value, value1 * value2);
                break;
            case "/":
                Console.WriteLine("{0} = {1:N2}", m.Value, value1 / value2);
                break;
        }
    }
}

// The example displays the following output:
//       16 + 21 = 37
//       31 * 3 = 93
//       28 / 3 = 9.33
//       42 - 18 = 24
//       12 * 7 = 84
Dim expressions() As String = {"16 + 21", "31 * 3", "28 / 3",
                              "42 - 18", "12 * 7",
                              "2, 4, 6, 8"}

Dim pattern As String = "(\d+)\s+([-+*/])\s+(\d+)"
For Each expression In expressions
    For Each m As Match In Regex.Matches(expression, pattern)
        Dim value1 As Integer = Int32.Parse(m.Groups(1).Value)
        Dim value2 As Integer = Int32.Parse(m.Groups(3).Value)
        Select Case m.Groups(2).Value
            Case "+"
                Console.WriteLine("{0} = {1}", m.Value, value1 + value2)
            Case "-"
                Console.WriteLine("{0} = {1}", m.Value, value1 - value2)
            Case "*"
                Console.WriteLine("{0} = {1}", m.Value, value1 * value2)
            Case "/"
                Console.WriteLine("{0} = {1:N2}", m.Value, value1 / value2)
        End Select
    Next
Next

' The example displays the following output:
'       16 + 21 = 37
'       31 * 3 = 93
'       28 / 3 = 9.33
'       42 - 18 = 24
'       12 * 7 = 84

正则表达式模式 (\d+)\s+([-+*/])\s+(\d+) 的定义如下:

模式 说明
(\d+) 匹配一个或多个十进制数字。 这是第一个捕获组。
\s+ 匹配一个或多个空白字符。
([-+*/]) 匹配算术运算符(+、-、* 或 /)。 这是第二个捕获组。
\s+ 匹配一个或多个空白字符。
(\d+) 匹配一个或多个十进制数字。 这是第三个捕获组。

你也可以使用正则表达式根据某种模式而非固定字符集提取字符串中的子字符串。 在下面的任意一种情况下,常用此方案:

  • 一个或多个分隔符不总是用作 String 实例中的分隔符。

  • 分隔符的顺序和数量多变或未知。

例如,不能使用 Split 方法拆分以下字符串,因为 \n(换行)符的数量是可变的,并且它们不总是用作分隔符。

[This is captured\ntext.]\n\n[\n[This is more captured text.]\n]
\n[Some more captured text:\n   Option1\n   Option2][Terse text.]

正则表达式可以轻松拆分此字符串,如下面的示例所示。

String input = "[This is captured\ntext.]\n\n[\n" +
               "[This is more captured text.]\n]\n" +
               "[Some more captured text:\n   Option1" +
               "\n   Option2][Terse text.]";
String pattern = @"\[([^\[\]]+)\]";
int ctr = 0;

foreach (System.Text.RegularExpressions.Match m in
   System.Text.RegularExpressions.Regex.Matches(input, pattern))
{
    Console.WriteLine("{0}: {1}", ++ctr, m.Groups[1].Value);
}

// The example displays the following output:
//       1: This is captured
//       text.
//       2: This is more captured text.
//       3: Some more captured text:
//          Option1
//          Option2
//       4: Terse text.
Dim input As String = String.Format("[This is captured{0}text.]" +
                                  "{0}{0}[{0}[This is more " +
                                  "captured text.]{0}{0}" +
                                  "[Some more captured text:" +
                                  "{0}   Option1" +
                                  "{0}   Option2][Terse text.]",
                                  vbCrLf)
Dim pattern As String = "\[([^\[\]]+)\]"
Dim ctr As Integer = 0
For Each m As Match In Regex.Matches(input, pattern)
    ctr += 1
    Console.WriteLine("{0}: {1}", ctr, m.Groups(1).Value)
Next

' The example displays the following output:
'       1: This is captured
'       text.
'       2: This is more captured text.
'       3: Some more captured text:
'          Option1
'          Option2
'       4: Terse text.

正则表达式模式 \[([^\[\]]+)\] 的定义如下:

模式 说明
\[ 匹配左方括号。
([^\[\]]+) 一次或多次与非左或右方括号的字符匹配 这是第一个捕获组。
\] 匹配右方括号。

Regex.Split 方法几乎与 String.Split 相同,不同之处在于,它根据正则表达式模式而非固定字符集拆分字符串。 例如,下面的示例使用 Regex.Split 方法拆分字符串,该字符串包含由连字符和其他字符的各种组合分隔的子字符串。

String input = "abacus -- alabaster - * - atrium -+- " +
               "any -*- actual - + - armoire - - alarm";
String pattern = @"\s-\s?[+*]?\s?-\s";
String[] elements = System.Text.RegularExpressions.Regex.Split(input, pattern);

foreach (string element in elements)
    Console.WriteLine(element);

// The example displays the following output:
//       abacus
//       alabaster
//       atrium
//       any
//       actual
//       armoire
//       alarm
Dim input As String = "abacus -- alabaster - * - atrium -+- " +
                    "any -*- actual - + - armoire - - alarm"
Dim pattern As String = "\s-\s?[+*]?\s?-\s"
Dim elements() As String = Regex.Split(input, pattern)
For Each element In elements
    Console.WriteLine(element)
Next

' The example displays the following output:
'       abacus
'       alabaster
'       atrium
'       any
'       actual
'       armoire
'       alarm

正则表达式模式 \s-\s?[+*]?\s?-\s 的定义如下:

模式 说明
\s- 匹配后跟一个连字符的空白字符。
\s? 匹配零个或一个空白字符。
[+*]? 与 + 或 * 字符的零个或一个匹配项匹配。
\s? 匹配零个或一个空白字符。
-\s 匹配后跟一个空白字符的连字符。

String.IndexOf 和 String.Substring 方法

如果并不需要字符串中的所有子字符串,则可能更想要使用一种字符串比较方法来返回匹配开始之处的索引。 然后,可以调用 Substring 方法来提取所需的子字符串。 字符串比较方法包括:

  • IndexOf,它返回字符串实例中的某个字符或字符串的第一个匹配项的从零开始的索引。

  • IndexOfAny,它返回在当前字符串实例中字符数组中任何字符的第一个匹配项的从零开始的索引。

  • LastIndexOf,它返回字符串实例中的某个字符或字符串的最后一个匹配项的从零开始的索引。

  • LastIndexOfAny,它返回在当前字符串实例中字符数组中任何字符的最后一个匹配项的从零开始的索引。

以下示例使用 IndexOf 方法查找字符串中的句点。 然后,它使用 Substring 方法返回完整句子。

String s = "This is the first sentence in a string. " +
               "More sentences will follow. For example, " +
               "this is the third sentence. This is the " +
               "fourth. And this is the fifth and final " +
               "sentence.";
var sentences = new List<String>();
int start = 0;
int position;

// Extract sentences from the string.
do
{
    position = s.IndexOf('.', start);
    if (position >= 0)
    {
        sentences.Add(s.Substring(start, position - start + 1).Trim());
        start = position + 1;
    }
} while (position > 0);

// Display the sentences.
foreach (var sentence in sentences)
    Console.WriteLine(sentence);

// The example displays the following output:
//       This is the first sentence in a string.
//       More sentences will follow.
//       For example, this is the third sentence.
//       This is the fourth.
//       And this is the fifth and final sentence.
    Dim input As String = "This is the first sentence in a string. " +
                        "More sentences will follow. For example, " +
                        "this is the third sentence. This is the " +
                        "fourth. And this is the fifth and final " +
                        "sentence."
    Dim sentences As New List(Of String)
    Dim start As Integer = 0
    Dim position As Integer

    ' Extract sentences from the string.
    Do
        position = input.IndexOf("."c, start)
        If position >= 0 Then
            sentences.Add(input.Substring(start, position - start + 1).Trim())
            start = position + 1
        End If
    Loop While position > 0

    ' Display the sentences.
    For Each sentence In sentences
        Console.WriteLine(sentence)
    Next
End Sub

' The example displays the following output:
'       This is the first sentence in a string.
'       More sentences will follow.
'       For example, this is the third sentence.
'       This is the fourth.
'       And this is the fifth and final sentence.

请参阅