System.Text.Rune struct

2024-01-08

This article provides supplementary remarks to the reference documentation for this API.

A Rune instance represents a Unicode scalar value, which means any code point excluding the surrogate range (U+D800..U+DFFF). The type's constructors and conversion operators validate the input, so consumers can call the APIs assuming that the underlying Rune instance is well formed.

If you aren't familiar with the terms Unicode scalar value, code point, surrogate range, and well-formed, see Introduction to character encoding in .NET.

When to use the Rune type

Consider using the Rune type if your code:

Calls APIs that require Unicode scalar values
Explicitly handles surrogate pairs

APIs that require Unicode scalar values

If your code iterates through the char instances in a string or a ReadOnlySpan<char>, some of the char methods won't work correctly on char instances that are in the surrogate range. For example, the following APIs require a scalar value char to work correctly:

The following example shows code that won't work correctly if any of the char instances are surrogate code points:

// THE FOLLOWING METHOD SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
int CountLettersBadExample(string s)
{
    int letterCount = 0;

    foreach (char ch in s)
    {
        if (char.IsLetter(ch))
        { letterCount++; }
    }

    return letterCount;
}

// THE FOLLOWING METHOD SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
let countLettersBadExample (s: string) =
    let mutable letterCount = 0

    for ch in s do
        if Char.IsLetter ch then
            letterCount <- letterCount + 1
    
    letterCount

Here's equivalent code that works with a ReadOnlySpan<char>:

// THE FOLLOWING METHOD SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
static int CountLettersBadExample(ReadOnlySpan<char> span)
{
    int letterCount = 0;

    foreach (char ch in span)
    {
        if (char.IsLetter(ch))
        { letterCount++; }
    }

    return letterCount;
}

The preceding code works correctly with some languages such as English:

CountLettersInString("Hello")
// Returns 5

But it won't work correctly for languages outside the Basic Multilingual Plane, such as Osage:

CountLettersInString("𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟")
// Returns 0

The reason this method returns incorrect results for Osage text is that the char instances for Osage letters are surrogate code points. No single surrogate code point has enough information to determine if it's a letter.

If you change this code to use Rune instead of char, the method works correctly with code points outside the Basic Multilingual Plane:

int CountLetters(string s)
{
    int letterCount = 0;

    foreach (Rune rune in s.EnumerateRunes())
    {
        if (Rune.IsLetter(rune))
        { letterCount++; }
    }

    return letterCount;
}

let countLetters (s: string) =
    let mutable letterCount = 0

    for rune in s.EnumerateRunes() do
        if Rune.IsLetter rune then
            letterCount <- letterCount + 1

    letterCount

Here's equivalent code that works with a ReadOnlySpan<char>:

static int CountLetters(ReadOnlySpan<char> span)
{
    int letterCount = 0;

    foreach (Rune rune in span.EnumerateRunes())
    {
        if (Rune.IsLetter(rune))
        { letterCount++; }
    }

    return letterCount;
}

The preceding code counts Osage letters correctly:

CountLettersInString("𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟")
// Returns 8

Code that explicitly handles surrogate pairs

Consider using the Rune type if your code calls APIs that explicitly operate on surrogate code points, such as the following methods:

For example, the following method has special logic to deal with surrogate char pairs:

static void ProcessStringUseChar(string s)
{
    Console.WriteLine("Using char");

    for (int i = 0; i < s.Length; i++)
    {
        if (!char.IsSurrogate(s[i]))
        {
            Console.WriteLine($"Code point: {(int)(s[i])}");
        }
        else if (i + 1 < s.Length && char.IsSurrogatePair(s[i], s[i + 1]))
        {
            int codePoint = char.ConvertToUtf32(s[i], s[i + 1]);
            Console.WriteLine($"Code point: {codePoint}");
            i++; // so that when the loop iterates it's actually +2
        }
        else
        {
            throw new Exception("String was not well-formed UTF-16.");
        }
    }
}

Such code is simpler if it uses Rune, as in the following example:

static void ProcessStringUseRune(string s)
{
    Console.WriteLine("Using Rune");

    for (int i = 0; i < s.Length;)
    {
        if (!Rune.TryGetRuneAt(s, i, out Rune rune))
        {
            throw new Exception("String was not well-formed UTF-16.");
        }

        Console.WriteLine($"Code point: {rune.Value}");
        i += rune.Utf16SequenceLength; // increment the iterator by the number of chars in this Rune
    }
}

When not to use `Rune`

You don't need to use the Rune type if your code:

Looks for exact char matches
Splits a string on a known char value

Using the Rune type may return incorrect results if your code:

Counts the number of display characters in a string

Look for exact `char` matches

The following code iterates through a string looking for specific characters, returning the index of the first match. There's no need to change this code to use Rune, as the code is looking for characters that are represented by a single char.

int GetIndexOfFirstAToZ(string s)
{
    for (int i = 0; i < s.Length; i++)
    {
        char thisChar = s[i];
        if ('A' <= thisChar && thisChar <= 'Z')
        {
            return i; // found a match
        }
    }

    return -1; // didn't find 'A' - 'Z' in the input string
}

Split a string on a known `char`

It's common to call string.Split and use delimiters such as ' ' (space) or ',' (comma), as in the following example:

string inputString = "🐂, 🐄, 🐆";
string[] splitOnSpace = inputString.Split(' ');
string[] splitOnComma = inputString.Split(',');

There is no need to use Rune here, because the code is looking for characters that are represented by a single char.

Count the number of display characters in a `string`

The number of Rune instances in a string might not match the number of user-perceivable characters shown when displaying the string.

Since Rune instances represent Unicode scalar values, components that follow the Unicode text segmentation guidelines can use Rune as a building block for counting display characters.

The StringInfo type can be used to count display characters, but it doesn't count correctly in all scenarios for .NET implementations other than .NET 5+.

For more information, see Grapheme clusters.

How to instantiate a `Rune`

There are several ways to get a Rune instance. You can use a constructor to create a Rune directly from:

A code point.

Rune a = new Rune(0x0061); // LATIN SMALL LETTER A
Rune b = new Rune(0x10421); // DESERET CAPITAL LETTER ER

A single char.
```
Rune c = new Rune('a');
```

A surrogate char pair.

Rune d = new Rune('\ud83d', '\udd2e'); // U+1F52E CRYSTAL BALL

All of the constructors throw an ArgumentException if the input doesn't represent a valid Unicode scalar value.

There are Rune.TryCreate methods available for callers who don't want exceptions to be thrown on failure.

Rune instances can also be read from existing input sequences. For instance, given a ReadOnlySpan<char> that represents UTF-16 data, the Rune.DecodeFromUtf16 method returns the first Rune instance at the beginning of the input span. The Rune.DecodeFromUtf8 method operates similarly, accepting a ReadOnlySpan<byte> parameter that represents UTF-8 data. There are equivalent methods to read from the end of the span instead of the beginning of the span.

Query properties of a `Rune`

To get the integer code point value of a Rune instance, use the Rune.Value property.

Rune rune = new Rune('\ud83d', '\udd2e'); // U+1F52E CRYSTAL BALL
int codePoint = rune.Value; // = 128302 decimal (= 0x1F52E)

Many of the static APIs available on the char type are also available on the Rune type. For instance, Rune.IsWhiteSpace and Rune.GetUnicodeCategory are equivalents to Char.IsWhiteSpace and Char.GetUnicodeCategory methods. The Rune methods correctly handle surrogate pairs.

The following example code takes a ReadOnlySpan<char> as input and trims from both the start and the end of the span every Rune that isn't a letter or a digit.

static ReadOnlySpan<char> TrimNonLettersAndNonDigits(ReadOnlySpan<char> span)
{
    // First, trim from the front.
    // If any Rune can't be decoded
    // (return value is anything other than "Done"),
    // or if the Rune is a letter or digit,
    // stop trimming from the front and
    // instead work from the end.
    while (Rune.DecodeFromUtf16(span, out Rune rune, out int charsConsumed) == OperationStatus.Done)
    {
        if (Rune.IsLetterOrDigit(rune))
        { break; }
        span = span[charsConsumed..];
    }

    // Next, trim from the end.
    // If any Rune can't be decoded,
    // or if the Rune is a letter or digit,
    // break from the loop, and we're finished.
    while (Rune.DecodeLastFromUtf16(span, out Rune rune, out int charsConsumed) == OperationStatus.Done)
    {
        if (Rune.IsLetterOrDigit(rune))
        { break; }
        span = span[..^charsConsumed];
    }

    return span;
}

There are some API differences between char and Rune. For example:

There is no Rune equivalent to Char.IsSurrogate(Char), since Rune instances by definition can never be surrogate code points.
The Rune.GetUnicodeCategory doesn't always return the same result as Char.GetUnicodeCategory. It does return the same value as CharUnicodeInfo.GetUnicodeCategory. For more information, see the Remarks on Char.GetUnicodeCategory.

Convert a `Rune` to UTF-8 or UTF-16

Since a Rune is a Unicode scalar value, it can be converted to UTF-8, UTF-16, or UTF-32 encoding. The Rune type has built-in support for conversion to UTF-8 and UTF-16.

The Rune.EncodeToUtf16 converts a Rune instance to char instances. To query the number of char instances that would result from converting a Rune instance to UTF-16, use the Rune.Utf16SequenceLength property. Similar methods exist for UTF-8 conversion.

The following example converts a Rune instance to a char array. The code assumes you have a Rune instance in the rune variable:

char[] chars = new char[rune.Utf16SequenceLength];
int numCharsWritten = rune.EncodeToUtf16(chars);

Since a string is a sequence of UTF-16 chars, the following example also converts a Rune instance to UTF-16:

string theString = rune.ToString();

The following example converts a Rune instance to a UTF-8 byte array:

byte[] bytes = new byte[rune.Utf8SequenceLength];
int numBytesWritten = rune.EncodeToUtf8(bytes);

The Rune.EncodeToUtf16 and Rune.EncodeToUtf8 methods return the actual number of elements written. They throw an exception if the destination buffer is too short to contain the result. There are non-throwing TryEncodeToUtf8 and TryEncodeToUtf16 methods as well for callers who want to avoid exceptions.

Rune in .NET vs. other languages

The term "rune" is not defined in the Unicode Standard. The term dates back to the creation of UTF-8. Rob Pike and Ken Thompson were looking for a term to describe what would eventually become known as a code point. They settled on the term "rune", and Rob Pike's later influence over the Go programming language helped popularize the term.

However, the .NET Rune type is not the equivalent of the Go rune type. In Go, the rune type is an alias for int32. A Go rune is intended to represent a Unicode code point, but it can be any 32-bit value, including surrogate code points and values that are not legal Unicode code points.

For similar types in other programming languages, see Rust's primitive char type or Swift's Unicode.Scalar type, both of which represent Unicode scalar values. They provide functionality similar to .NET's Rune type, and they disallow instantiation of values that are not legal Unicode scalar values.

Share via

System.Text.Rune struct

When to use the Rune type

APIs that require Unicode scalar values

Code that explicitly handles surrogate pairs

When not to use Rune

Look for exact char matches

Split a string on a known char

Count the number of display characters in a string

How to instantiate a Rune

Query properties of a Rune

Convert a Rune to UTF-8 or UTF-16