StringInfo and TextElementEnumerator are now UAX29-compliant

Prior to this change, System.Globalization.StringInfo and System.Globalization.TextElementEnumerator didn't properly handle all grapheme clusters. Some graphemes were split into their constituent components instead of being kept together. Now, StringInfo and TextElementEnumerator process grapheme clusters according to the latest version of the Unicode Standard.

In addition, the Microsoft.VisualBasic.Strings.StrReverse method, which reverses the characters in a string in Visual Basic, now also follows the Unicode standard for grapheme clusters.

Change description

A grapheme or extended grapheme cluster is a single user-perceived character that may be made up of multiple Unicode code points. For example, the string containing the Thai character "kam" (กำ) consists of the following two characters:

  • (= '\u0e01') THAI CHARACTER KO KAI
  • (= '\u0e33') THAI CHARACTER SARA AM

When displayed to the user, the operating system combines the two characters to form the single display character (or grapheme) "kam" or กำ. Emoji can also consist of multiple characters that are combined for display in a similar way.

Tip

The .NET documentation sometimes uses the term "text element" when referring to a grapheme.

The StringInfo and TextElementEnumerator classes inspect strings and return information about the graphemes they contain. In .NET Framework (all versions) and .NET Core 3.x and earlier, these two classes use custom logic that handles some combining classes but doesn't fully comply with the Unicode Standard. For example, the StringInfo and TextElementEnumerator classes incorrectly split the single Thai character "kam" back into its constituent components instead of keeping them together. These classes also incorrectly split the emoji character "🤷🏽‍♀️" into four clusters (person shrugging, skin tone modifier, gender modifier, and an invisible combiner) instead of keeping them together as a single grapheme cluster.

Starting with .NET 5, the StringInfo and TextElementEnumerator classes implement the Unicode standard as defined by Unicode Standard Annex #29, rev. 35, sec. 3. In particular, they now return extended grapheme clusters for all combining classes.

Consider the following C# code:

using System.Globalization;

static void Main(string[] args)
{
    PrintGraphemes("กำ");
    PrintGraphemes("🤷🏽‍♀️");
}

static void PrintGraphemes(string str)
{
    Console.WriteLine($"Printing graphemes of \"{str}\"...");
    int i = 0;

    TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(str);
    while (enumerator.MoveNext())
    {
        Console.WriteLine($"Grapheme {++i}: \"{enumerator.Current}\"");
    }

    Console.WriteLine($"({i} grapheme(s) total.)");
    Console.WriteLine();
}

In .NET Framework and .NET Core 3.x and earlier versions, the graphemes are split up, and the console output is as follows:

Printing graphemes of "กำ"...
Grapheme 1: "ก"
Grapheme 2: "ำ"
(2 grapheme(s) total.)

Printing graphemes of "🤷🏽‍♀️"...
Grapheme 1: "🤷"
Grapheme 2: "🏽"
Grapheme 3: "‍"
Grapheme 4: "♀️"
(4 grapheme(s) total.)

In .NET 5 and later versions, the graphemes are kept together, and the console output is as follows:

Printing graphemes of "กำ"...
Grapheme 1: "กำ"
(1 grapheme(s) total.)

Printing graphemes of "🤷🏽‍♀️"...
Grapheme 1: "🤷🏽‍♀️"
(1 grapheme(s) total.)

In addition, starting in .NET 5, the Microsoft.VisualBasic.Strings.StrReverse method, which reverses the characters in a string in Visual Basic, now also follows the Unicode standard for grapheme clusters.

These changes are part of a wider set of Unicode and UTF-8 improvements in .NET, including an extended grapheme cluster enumeration API to complement the Unicode scalar-value enumeration APIs that were introduced with the System.Text.Rune type in .NET Core 3.0.

Version introduced

.NET 5.0

You don't need to take any action. Your apps will automatically behave in a more standards-compliant manner in a variety of globalization-related scenarios.

Affected APIs