6 Lexical structure

6.1 Programs

A C# program consists of one or more source files, each known formally as a compilation unit (§14.2). Although a compilation unit might have a one-to-one correspondence with a file in a file system, such correspondence is not required.

Conceptually speaking, a program is compiled using three steps:

Transformation, which converts a file from a particular character repertoire and encoding scheme into a sequence of Unicode characters.
Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens.
Syntactic analysis, which translates the stream of tokens into executable code.

Apart from accepting UTF-8 encoded input (as required by §5, a conforming implementation may choose to accept and transform additional character encoding schemes (such as UTF-16, UTF-32, or non-Unicode character mappings).

Note: The handling of the Unicode NULL character (U+0000) is implementation-defined. It is strongly recommended that developers avoid using this character in their source code, for the sake of both portability and readability. When the character is required within a character or string literal, the escape sequences \0 or \u0000 may be used instead. end note

Note: It is beyond the scope of this specification to define how a file using a character representation other than Unicode might be transformed into a sequence of Unicode characters. During such transformation, however, it is recommended that the usual line-separating character (or sequence) in the other character set be translated to the two-character sequence consisting of the Unicode carriage-return character (U+000D) followed by Unicode line-feed character (U+000A). For the most part this transformation will have no visible effects; however, it will affect the interpretation of verbatim string literal tokens (§6.4.5.6). The purpose of this recommendation is to allow a verbatim string literal to produce the same character sequence when its compilation unit is moved between systems that support differing non-Unicode character sets, in particular, those using differing character sequences for line-separation. end note

6.2 Grammars

6.2.1 General

This specification presents the syntax of the C# programming language using two grammars. The lexical grammar (§6.2.3) defines how Unicode characters are combined to form line terminators, white space, comments, tokens, and pre-processing directives. The syntactic grammar (§6.2.4) defines how the tokens resulting from the lexical grammar are combined to form C# programs.

All terminal characters are to be understood as the appropriate Unicode character from the range U+0020 to U+007F, as opposed to any similar-looking characters from other Unicode character ranges.

6.2.2 Grammar notation

The lexical and syntactic grammars are presented in the ANTLR grammar tool’s Extended Backus-Naur form.

While the ANTLR notation is used, this specification does not present a complete ANTLR-ready “reference grammar” for C#; writing a lexer and parser, either by hand or using a tool such as ANTLR, is outside the scope of a language specification. With that qualification, this specification attempts to minimize the gap between the specified grammar and that required to build a lexer and parser in ANTLR.

ANTLR distinguishes between lexical and syntactic, termed parser by ANTLR, grammars in its notation by starting lexical rules with an uppercase letter and parser rules with a lowercase letter.

Note: The C# lexical grammar (§6.2.3) and syntactic grammar (§6.2.4) are not in exact correspondence with the ANTLR division into lexical and parser grammers. This small mismatch means that some ANTLR parser rules are used when specifying the C# lexical grammar. end note

6.2.3 Lexical grammar

The lexical grammar of C# is presented in §6.3, §6.4, and §6.5. The terminal symbols of the lexical grammar are the characters of the Unicode character set, and the lexical grammar specifies how characters are combined to form tokens (§6.4), white space (§6.3.4), comments (§6.3.3), and pre-processing directives (§6.5).

Many of the terminal symbols of the syntactic grammar are not defined explicitly as tokens in the lexical grammar. Rather, advantage is taken of the ANTLR behavior that literal strings in the grammar are extracted as implicit lexical tokens; this allows keywords, operators, etc. to be represented in the grammar by their literal representation rather than a token name.

Every compilation unit in a C# program shall conform to the input production of the lexical grammar (§6.3.1).

6.2.4 Syntactic grammar

The syntactic grammar of C# is presented in the clauses, subclauses, and annexes that follow this subclause. The terminal symbols of the syntactic grammar are the tokens defined explicitly by the lexical grammar and implicitly by literal strings in the grammar itself (§6.2.3). The syntactic grammar specifies how tokens are combined to form C# programs.

Every compilation unit in a C# program shall conform to the compilation_unit production (§14.2) of the syntactic grammar.

6.2.5 Grammar ambiguities

The productions for:

simple_name (§12.8.4),
member_access (§12.8.7),
null_conditional_member_access (§12.8.8),
dependent_access (§12.8.8),
base_access (§12.8.15) and
pointer_member_access (§24.6.3);

(the “disambiguated productions”) can give rise to ambiguities in the grammar for expressions.

These productions occur in contexts where a value can occur in an expression, and have one or more alternatives that end with the grammar “identifier type_argument_list?”. It is the optional type_argument_list which results in the possible ambiguity.

Example: The statement:
F(G<A, B>(7));
could be interpreted as a call to F with two arguments, G < A and B > (7). Alternatively, it could be interpreted as a call to F with one argument, which is a call to a generic method G with two type arguments and one regular argument.

end example

If a sequence of tokens can be parsed, in context, as one of the disambiguated productions including an optional type_argument_list (§8.4.2), then the token immediately following the closing > token shall be examined and if it is:

one of ( ) ] } : ; , . ? == != | ^ && || & [ =>; or
one of the relational operators < <= >= is as; or
a contextual query keyword appearing inside a query expression.
In certain contexts, identifier is treated as a disambiguating token. Those contexts are where the sequence of tokens being disambiguated is immediately preceded by one of the keywords is, case or out, or arises while parsing the first element of a tuple literal (in which case the tokens are preceded by ( or : and the identifier is followed by a ,) or a subsequent element of a tuple literal.

then the type_argument_list shall be retained as part of the disambiguated production and any other possible parse of the sequence of tokens discarded. Otherwise, the tokens parsed as a type_argument_list shall not be considered to be part of the disambiguated production, even if there is no other possible parse of those tokens.

Note: These disambiguation rules shall not be applied when parsing other productions even if they similarly end in “identifier type_argument_list?”; such productions shall be parsed as normal. Examples include: namespace_or_type_name (§7.8); named_entity (§12.8.23); null_conditional_projection_initializer (§12.8.8); and qualified_alias_member (§14.9.1). end note

Example: The statement:
F(G<A, B>(7));
will, according to this rule, be interpreted as a call to F with one argument, which is a call to a generic method G with two type arguments and one regular argument. The statements
F(G<A, B>7);
F(G<A, B>>7);
will each be interpreted as a call to F with two arguments. The statement
x = F<A> + y;
will be interpreted as a less-than operator, greater-than operator and unary-plus operator, as if the statement had been written x = (F < A) > (+y), instead of as a simple_name with a type_argument_list followed by a binary-plus operator. In the statement
x = y is C<T> && z;
the tokens C<T> are interpreted as a namespace_or_type_name with a type_argument_list due to the presence of the disambiguating token && after the type_argument_list.

The expression (A < B, C > D) is a tuple with two elements, each a comparison.

The expression (A<B,C> D, E) is a tuple with two elements, the first of which is a declaration expression.

The invocation M(A < B, C > D, E) has three arguments.

The invocation M(out A<B,C> D, E) has two arguments, the first of which is an out declaration.

The expression e is A<B> C uses a declaration pattern.

The case label case A<B> C: uses a declaration pattern.

end example

6.3 Lexical analysis

6.3.1 General

For convenience, the lexical grammar defines and references the following named lexer tokens:

DEFAULT  : 'default' ;
NULL     : 'null' ;
TRUE     : 'true' ;
FALSE    : 'false' ;
ASTERISK : '*' ;
SLASH    : '/' ;

Although these are lexer rules, these names are spelled in all-uppercase letters to distinguish them from ordinary lexer rule names.

Note: These convenience rules are exceptions to the usual practice of not providing explicit token names for tokens defined by literal strings. end note

The input production defines the lexical structure of a C# compilation unit.

input
    : input_section?
    ;

input_section
    : input_section_part+
    ;

input_section_part
    : input_element* New_Line
    | PP_Directive
    ;

input_element
    : Whitespace
    | Comment
    | token
    ;

Note: The above grammar is described by ANTLR parsing rules, it defines the lexical structure of a C# compilation unit and not lexical tokens. end note

Five basic elements make up the lexical structure of a C# compilation unit: Line terminators (§6.3.2), white space (§6.3.4), comments (§6.3.3), tokens (§6.4), and pre-processing directives (§6.5). Of these basic elements, only tokens are significant in the syntactic grammar of a C# program (§6.2.4).

The lexical processing of a C# compilation unit consists of reducing the file into a sequence of tokens that becomes the input to the syntactic analysis. Line terminators, white space, and comments can serve to separate tokens, and pre-processing directives can cause sections of the compilation unit to be skipped, but otherwise these lexical elements have no impact on the syntactic structure of a C# program.

When several lexical grammar productions match a sequence of characters in a compilation unit, the lexical processing always forms the longest possible lexical element.

Example: The character sequence // is processed as the beginning of a single-line comment because that lexical element is longer than a single / token. end example

Some tokens are defined by a set of lexical rules; a main rule and one or more sub-rules. The latter are marked in the grammar by fragment to indicate the rule defines part of another token. Fragment rules are not considered in the top-to-bottom ordering of lexical rules.

Note: In ANTLR fragment is a keyword which produces the same behavior defined here. end note

6.3.2 Line terminators

Line terminators divide the characters of a C# compilation unit into lines.

New_Line
    : New_Line_Character
    | '\u000D\u000A'    // carriage return, line feed
    ;

For compatibility with source code editing tools that add end-of-file markers, and to enable a compilation unit to be viewed as a sequence of properly terminated lines, the following transformations are applied, in order, to every compilation unit in a C# program:

If the last character of the compilation unit is a Control-Z character (U+001A), this character is deleted.
A carriage-return character (U+000D) is added to the end of the compilation unit if that compilation unit is non-empty and if the last character of the compilation unit is not a carriage return (U+000D), a line feed (U+000A), a next line character (U+0085), a line separator (U+2028), or a paragraph separator (U+2029).

Note: The additional carriage-return allows a program to end in a PP_Directive (§6.5) that does not have a terminating New_Line. end note

6.3.3 Comments

Two forms of comments are supported: delimited comments and single-line comments.

A delimited comment begins with the characters /* and ends with the characters */. Delimited comments can occupy a portion of a line, a single line, or multiple lines.

Example: The example

/* Hello, world program
   This program writes "hello, world" to the console
*/
class Hello
{
    static void Main()
    {
        System.Console.WriteLine("hello, world");
    }
}

includes a delimited comment.

end example

A single-line comment begins with the characters // and extends to the end of the line.

Example: The example

// Hello, world program
// This program writes "hello, world" to the console
//
class Hello // any name will do for this class
{
    static void Main() // this method must be named "Main"
    {
        System.Console.WriteLine("hello, world");
    }
}

shows several single-line comments.

end example

Comment
    : Single_Line_Comment
    | Delimited_Comment
    ;

fragment Single_Line_Comment
    : '//' Input_Character*
    ;

fragment Input_Character
    // anything but New_Line_Character
    : ~('\u000D' | '\u000A'   | '\u0085' | '\u2028' | '\u2029')
    ;

fragment New_Line_Character
    : '\u000D'  // carriage return
    | '\u000A'  // line feed
    | '\u0085'  // next line
    | '\u2028'  // line separator
    | '\u2029'  // paragraph separator
    ;

fragment Delimited_Comment
    : '/*' Delimited_Comment_Section* ASTERISK+ '/'
    ;

fragment Delimited_Comment_Section
    : SLASH
    | ASTERISK* Not_Slash_Or_Asterisk
    ;

fragment Not_Slash_Or_Asterisk
    : ~('/' | '*')    // Any except SLASH or ASTERISK
    ;

Comments do not nest. The character sequences /* and */ have no special meaning within a single-line comment, and the character sequences // and /* have no special meaning within a delimited comment.

Comments are not processed within character and string literals.

Note: These rules must be interpreted carefully. For instance, in the example below, the delimited comment that begins before A ends between B and C(). The reason is that
// B */ C();
is not actually a single-line comment, since // has no special meaning within a delimited comment, and so */ does have its usual special meaning in that line.

Likewise, the delimited comment starting before D ends before E. The reason is that "D */ " is not actually a string literal, since the initial double quote character appears inside a delimited comment.

A useful consequence of /* and */ having no special meaning within a single-line comment is that a block of source code lines can be commented out by putting // at the beginning of each line. In general, it does not work to put /* before those lines and */ after them, as this does not properly encapsulate delimited comments in the block, and in general may completely change the structure of such delimited comments.

Example code:
static void Main()
{
    /* A
    // B */ C();
    Console.WriteLine(/* "D */ "E");
}
end note

Single_Line_Comments and Delimited_Comments having particular formats can be used as documentation comments, as described in §D.

6.3.4 White space

White space is defined as any character with Unicode class Zs (which includes the space character) as well as the horizontal tab character, the vertical tab character, and the form feed character.

Whitespace
    : [\p{Zs}]  // any character with Unicode class Zs
    | '\u0009'  // horizontal tab
    | '\u000B'  // vertical tab
    | '\u000C'  // form feed
    ;

6.4 Tokens

6.4.1 General

There are several kinds of tokens: identifiers, keywords, literals, operators, and punctuators. White space and comments are not tokens, though they act as separators for tokens.

token
    : identifier
    | keyword
    | Integer_Literal
    | Real_Literal
    | Character_Literal
    | String_Literal
    | operator_or_punctuator
    ;

Note: This is an ANTLR parser rule, it does not define a lexical token but rather the collection of token kinds. end note

6.4.2 Unicode character escape sequences

A Unicode character escape sequence represents a Unicode code point. Unicode character escape sequences are processed in identifiers (§6.4.3), character literals (§6.4.5.5), regular string literals (§6.4.5.6), and interpolated regular string expressions (§12.8.3). A Unicode character escape sequence is not processed in any other location (for example, to form an operator, punctuator, or keyword).

fragment Unicode_Escape_Sequence
    : '\\u' Hex_Digit Hex_Digit Hex_Digit Hex_Digit
    | '\\U' Hex_Digit Hex_Digit Hex_Digit Hex_Digit
            Hex_Digit Hex_Digit Hex_Digit Hex_Digit
    ;

A Unicode_Escape_Sequence represents the Unicode code point whose value is the hexadecimal number following the “\u” or “\U” characters. Unicode code points in the range U+10000 to U+10FFFF require two UTF-16 surrogate code units; as C# char values are represented using a single UTF-16 code unit (§8.3.6) Unicode code points above U+FFFF are not permitted in character literals. Unicode code points above U+10FFFF are invalid and are not supported.

Multiple translations are not performed. For instance, the string literal "\u005Cu005C" is equivalent to "\u005C" rather than "\".

Note: The Unicode value \u005C is the character “\”. end note

Example: The example

class Class1
{
    static void Test(bool \u0066)
    {
        char c = '\u0066';
        if (\u0066)
        {
            System.Console.WriteLine(c.ToString());
        }
    }
}

shows several uses of \u0066, which is the escape sequence for the letter “f”. The program is equivalent to

class Class1
{
    static void Test(bool f)
    {
        char c = 'f';
        if (f)
        {
            System.Console.WriteLine(c.ToString());
        }
    }
}

end example

6.4.3 Identifiers

The rules for identifiers given in this subclause correspond exactly to those recommended by the Unicode Standard Annex 15 except that underscore is allowed as an initial character (as is traditional in the C programming language), Unicode escape sequences are permitted in identifiers, and the “@” character is allowed as a prefix to enable keywords to be used as identifiers.

identifier
    : Simple_Identifier
    | contextual_keyword
    | discard_token
    ;

discard_token
    : '_'
    ;

Simple_Identifier
    : Available_Identifier
    | Escaped_Identifier
    ;

fragment Available_Identifier
    // excluding keywords or contextual keywords, see note below
    : Basic_Identifier
    ;

fragment Escaped_Identifier
    // Includes keywords and contextual keywords prefixed by '@'.
    // See note below.
    : '@' Basic_Identifier
    ;

fragment Basic_Identifier
    : Identifier_Start_Character Identifier_Part_Character*
    ;

fragment Identifier_Start_Character
    : Letter_Character
    | Underscore_Character
    ;

fragment Underscore_Character
    : '_'               // underscore
    | '\\u005' [fF]     // Unicode_Escape_Sequence for underscore
    | '\\U0000005' [fF] // Unicode_Escape_Sequence for underscore
    ;

fragment Identifier_Part_Character
    : Letter_Character
    | Decimal_Digit_Character
    | Connecting_Character
    | Combining_Character
    | Formatting_Character
    ;

fragment Letter_Character
    // Category Letter, all subcategories; category Number, subcategory letter.
    : [\p{L}\p{Nl}]
    // Only escapes for categories L & Nl allowed. See note below.
    | Unicode_Escape_Sequence
    ;

fragment Combining_Character
    // Category Mark, subcategories non-spacing and spacing combining.
    : [\p{Mn}\p{Mc}]
    // Only escapes for categories Mn & Mc allowed. See note below.
    | Unicode_Escape_Sequence
    ;

fragment Decimal_Digit_Character
    // Category Number, subcategory decimal digit.
    : [\p{Nd}]
    // Only escapes for category Nd allowed. See note below.
    | Unicode_Escape_Sequence
    ;

fragment Connecting_Character
    // Category Punctuation, subcategory connector.
    : [\p{Pc}]
    // Only escapes for category Pc allowed. See note below.
    | Unicode_Escape_Sequence
    ;

fragment Formatting_Character
    // Category Other, subcategory format.
    : [\p{Cf}]
    // Only escapes for category Cf allowed, see note below.
    | Unicode_Escape_Sequence
    ;

Note:

For information on the Unicode character classes mentioned above, see The Unicode Standard.

The fragment Available_Identifier requires the exclusion of keywords and contextual keywords. If the grammar in this specification is processed with ANTLR then this exclusion is handled automatically by the semantics of ANTLR:

Keywords and contextual keywords occur in the grammar as literal strings.

ANTLR creates implicit lexical token rules for these literal strings.

ANTLR considers these implicit rules before the explicit lexical rules in the grammar.

Therefore fragment Available_Identifier will not match keywords or contextual keywords as the lexical rules for those precede it.

Fragment Escaped_Identifier includes escaped keywords and contextual keywords as they are part of the longer token starting with an @ and lexical processing always forms the longest possible lexical element (§6.3.1).

How an implementation enforces the restrictions on the allowable Unicode_Escape_Sequence values is an implementation issue.

end note

Example: Examples of valid identifiers are identifier1, _identifier2, and @if. end example

An identifier in a conforming program shall be in the canonical format defined by Unicode Normalization Form C, as defined by Unicode Standard Annex 15. The behavior when encountering an identifier not in Normalization Form C is implementation-defined; however, a diagnostic is not required.

The prefix “@” enables the use of keywords as identifiers, which is useful when interfacing with other programming languages. The character @ is not actually part of the identifier, so the identifier might be seen in other languages as a normal identifier, without the prefix. An identifier with an @ prefix is called a verbatim identifier.

Note: Use of the @ prefix for identifiers that are not keywords is permitted, but strongly discouraged as a matter of style. end note

Example: The example:
class @class
{
    public static void @static(bool @bool)
    {
        if (@bool)
        {
            System.Console.WriteLine("true");
        }
        else
        {
            System.Console.WriteLine("false");
        }
    }
}

class Class1
{
    static void M()
    {
        cl\u0061ss.st\u0061tic(true);
    }
}
defines a class named “class” with a static method named “static” that takes a parameter named “bool”. Note that since Unicode escapes are not permitted in keywords, the token “cl\u0061ss” is an identifier, and is the same identifier as “@class”.

end example

Two identifiers are considered the same if they are identical after the following transformations are applied, in order:

The prefix “@”, if used, is removed.
Each Unicode_Escape_Sequence is transformed into its corresponding Unicode character.
Any Formatting_Characters are removed.

The semantics of an identifier named _ depends on the context in which it appears:

It can denote a named program element, such as a variable, class, or method, or
It can denote a discard (§9.2.9.2).

Identifiers containing two consecutive underscore characters (U+005F) are reserved for use by the implementation; however, no diagnostic is required if such an identifier is defined.

Note: For example, an implementation might provide extended keywords that begin with two underscores. end note

6.4.4 Keywords

A keyword is an identifier-like sequence of characters that is reserved, and cannot be used as an identifier except when prefaced by the @ character.

keyword
    : 'abstract' | 'as'       | 'base'       | 'bool'      | 'break'
    | 'byte'     | 'case'     | 'catch'      | 'char'      | 'checked'
    | 'class'    | 'const'    | 'continue'   | 'decimal'   | DEFAULT
    | 'delegate' | 'do'       | 'double'     | 'else'      | 'enum'
    | 'event'    | 'explicit' | 'extern'     | FALSE       | 'file'      | 'finally'
    | 'fixed'    | 'float'    | 'for'        | 'foreach'   | 'goto'
    | 'if'       | 'implicit' | 'in'         | 'int'       | 'interface'
    | 'internal' | 'is'       | 'lock'       | 'long'      | 'namespace'
    | 'new'      | NULL       | 'object'     | 'operator'  | 'out'
    | 'override' | 'params'   | 'private'    | 'protected' | 'public'
    | 'readonly' | 'ref'      | 'return'     | 'sbyte'     | 'sealed'
    | 'short'    | 'sizeof'   | 'stackalloc' | 'static'    | 'string'
    | 'struct'   | 'switch'   | 'this'       | 'throw'     | TRUE
    | 'try'      | 'typeof'   | 'uint'       | 'ulong'     | 'unchecked'
    | 'unsafe'   | 'ushort'   | 'using'      | 'virtual'   | 'void'
    | 'volatile' | 'while'
    ;

A contextual keyword is an identifier-like sequence of characters that has special meaning in certain contexts, but is not reserved, and can be used as an identifier outside of those contexts as well as when prefaced by the @ character.

contextual_keyword
    : 'add'      | 'alias'      | 'and'        | 'ascending' | 'async'
    | 'await'    | 'by'         | 'Cdecl'      | 'descending'| 'dynamic'
    | 'equals'   | 'Fastcall'   | 'from'       | 'get'       | 'global'
    | 'group'    | 'init'       | 'into'       | 'join'      | 'let'
    | 'managed'  | 'nameof'     | 'nint'       | 'not'       | 'notnull'
    | 'nuint'    | 'on'         | 'or'         | 'orderby'   | 'partial'
    | 'record'   | 'remove'     | 'required'   | 'scoped'    | 'select'    | 'set'       | 'Stdcall'
    | 'Thiscall' | 'unmanaged'  | 'value'      | 'var'       | 'when'
    | 'where'    | 'yield'
    ;

Note: The rules keyword and contextual_keyword are parser rules as they do not introduce new token kinds. All keywords and contextual keywords are defined by implicit lexical rules as they occur as literal strings in the grammar (§6.2.3). end note

In most cases, the syntactic location of contextual keywords is such that they can never be confused with ordinary identifier usage. For example, within a property declaration, the get, init, and set identifiers have special meaning (§15.7.3). An identifier other than get, init, or set is never permitted in these locations, so this use does not conflict with a use of these words as identifiers.

In certain cases the grammar is not enough to distinguish contextual keyword usage from identifiers. In all such cases it will be specified how to disambiguate between the two. For example, the contextual keyword var in implicitly typed local variable declarations (§13.6.2) might conflict with a declared type called var, in which case the declared name takes precedence over the use of the identifier as a contextual keyword.

Another example such disambiguation is the contextual keyword await (§12.9.9.1), which is considered a keyword only when inside a method declared async, but can be used as an identifier elsewhere.

Just as with keywords, contextual keywords can be used as ordinary identifiers by prefixing them with the @ character.

Note: When used as contextual keywords, these identifiers cannot contain Unicode_Escape_Sequences. end note

6.4.5 Literals

6.4.5.1 General

A literal (§12.8.2) is a source-code representation of a value.

literal
    : boolean_literal
    | Integer_Literal
    | Real_Literal
    | Character_Literal
    | String_Literal
    | null_literal
    ;

Note: literal is a parser rule as it groups other token kinds and does not introduce a new token kind. end note

6.4.5.2 Boolean literals

There are two Boolean literal values: true and false.

boolean_literal
    : TRUE
    | FALSE
    ;

Note: boolean_literal is a parser rule as it groups other token kinds and does not introduce a new token kind. end note

The type of a boolean_literal is bool.

6.4.5.3 Integer literals

Integer literals are used to write values of types int, uint, long, and ulong. Integer literals have three possible forms: decimal, hexadecimal, and binary.

Note: There is no way to write literal values of type nint and nuint. Instead, implicit or explicit casts of other integral constant values may be used. end note

Integer_Literal
    : Decimal_Integer_Literal
    | Hexadecimal_Integer_Literal
    | Binary_Integer_Literal
    ;

fragment Decimal_Integer_Literal
    : Decimal_Digit Decorated_Decimal_Digit* Integer_Type_Suffix?
    ;

fragment Decorated_Decimal_Digit
    : '_'* Decimal_Digit
    ;

fragment Decimal_Digit
    : '0'..'9'
    ;

fragment Integer_Type_Suffix
    : 'U' | 'u' | 'L' | 'l' |
      'UL' | 'Ul' | 'uL' | 'ul' | 'LU' | 'Lu' | 'lU' | 'lu'
    ;

fragment Hexadecimal_Integer_Literal
    : ('0x' | '0X') Decorated_Hex_Digit+ Integer_Type_Suffix?
    ;

fragment Decorated_Hex_Digit
    : '_'* Hex_Digit
    ;

fragment Hex_Digit
    : '0'..'9' | 'A'..'F' | 'a'..'f'
    ;

fragment Binary_Integer_Literal
    : ('0b' | '0B') Decorated_Binary_Digit+ Integer_Type_Suffix?
    ;

fragment Decorated_Binary_Digit
    : '_'* Binary_Digit
    ;

fragment Binary_Digit
    : '0' | '1'
    ;

The type of an integer literal is determined as follows:

If the literal has no suffix, it has the first of these types in which its value can be represented: int, uint, long, ulong.
If the literal is suffixed by U or u, it has the first of these types in which its value can be represented: uint, ulong.
If the literal is suffixed by L or l, it has the first of these types in which its value can be represented: long, ulong.
If the literal is suffixed by UL, Ul, uL, ul, LU, Lu, lU, or lu, it is of type ulong.

If the value represented by an integer literal is outside the range of the ulong type, a compile-time error occurs.

Note: As a matter of style, it is suggested that “L” be used instead of “l” when writing literals of type long, since it is easy to confuse the letter “l” with the digit “1”. end note

To permit the smallest possible int and long values to be written as integer literals, the following two rules exist:

When an Integer_Literal representing the value 2147483648 (2³¹) and no Integer_Type_Suffix appears as the token immediately following a unary minus operator token (§12.9.3), the result (of both tokens) is a constant of type int with the value −2147483648 (−2³¹). In all other situations, such an Integer_Literal is of type uint.
When an Integer_Literal representing the value 9223372036854775808 (2⁶³) and no Integer_Type_Suffix or the Integer_Type_Suffix L or l appears as the token immediately following a unary minus operator token (§12.9.3), the result (of both tokens) is a constant of type long with the value −9223372036854775808 (−2⁶³). In all other situations, such an Integer_Literal is of type ulong.

Example:

123                  // decimal, int
10_543_765Lu         // decimal, ulong
1_2__3___4____5      // decimal, int
_123                 // not a numeric literal; identifier due to leading _
123_                 // invalid; no trailing _allowed

0xFf                 // hex, int
0X1b_a0_44_fEL       // hex, long
0x1ade_3FE1_29AaUL   // hex, ulong
0x_abc               // hex, int
_0x123               // not a numeric literal; identifier due to leading _
0xabc_               // invalid; no trailing _ allowed

0b101                // binary, int
0B1001_1010u         // binary, uint
0b1111_1111_0000UL   // binary, ulong
0B__111              // binary, int
__0B111              // not a numeric literal; identifier due to leading _
0B111__              // invalid; no trailing _ allowed

end example

6.4.5.4 Real literals

Real literals are used to write values of types float, double, and decimal.

Real_Literal
    : Decimal_Digit Decorated_Decimal_Digit* '.'
      Decimal_Digit Decorated_Decimal_Digit* Exponent_Part? Real_Type_Suffix?
    | '.' Decimal_Digit Decorated_Decimal_Digit* Exponent_Part? Real_Type_Suffix?
    | Decimal_Digit Decorated_Decimal_Digit* Exponent_Part Real_Type_Suffix?
    | Decimal_Digit Decorated_Decimal_Digit* Real_Type_Suffix
    ;

fragment Exponent_Part
    : ('e' | 'E') Sign? Decimal_Digit Decorated_Decimal_Digit*
    ;

fragment Sign
    : '+' | '-'
    ;

fragment Real_Type_Suffix
    : 'F' | 'f' | 'D' | 'd' | 'M' | 'm'
    ;

If no Real_Type_Suffix is specified, the type of the Real_Literal is double. Otherwise, the Real_Type_Suffix determines the type of the real literal, as follows:

A real literal suffixed by F or f is of type float.

Example: The literals 1f, 1.5f, 1e10f, and 123.456F are all of type float. end example
A real literal suffixed by D or d is of type double.

Example: The literals 1d, 1.5d, 1e10d, and 123.456D are all of type double. end example
A real literal suffixed by M or m is of type decimal.

Example: The literals 1m, 1.5m, 1e10m, and 123.456M are all of type decimal. end example This literal is converted to a decimal value by taking the exact value, and, if necessary, rounding to the nearest representable value using banker’s rounding (§8.3.8). Any scale apparent in the literal is preserved unless the value is rounded. Note: Hence, the literal 2.900m will be parsed to form the decimal with sign 0, coefficient 2900, and scale 3. end note

If the magnitude of the specified literal is too large to be represented in the indicated type, a compile-time error occurs.

Note: In particular, a Real_Literal will never produce a floating-point infinity. A non-zero Real_Literal may, however, be rounded to zero. end note

The value of a real literal of type float or double is determined by using the IEC 60559 “round to nearest” mode with ties broken to “even” (a value with the least-significant-bit zero), and all digits considered significant.

Note: In a real literal, decimal digits are always required after the decimal point. For example, 1.3F is a real literal but 1.F is not. end note

Example:

1.234_567      // double
.3e5f          // float
2_345E-2_0     // double
15D            // double
19.73M         // decimal
1.F            // parsed as a member access of F due to non-digit after .
1_.2F          // invalid; no trailing _ allowed in integer part
1._234         // parsed as a member access of _234 due to non-digit after .
1.234_         // invalid; no trailing _ allowed in fraction
.3e_5F         // invalid; no leading _ allowed in exponent
.3e5_F         // invalid; no trailing _ allowed in exponent

end example

6.4.5.5 Character literals

A character literal represents a single character as a UTF-16 code unit, and consists of a character or Unicode_Escape_Sequence in quotes, as in 'a', '\u0061', or '\U00000061'.

Character_Literal
    : '\'' Character '\''
    ;

fragment Character
    : Single_Character
    | Simple_Escape_Sequence
    | Hexadecimal_Escape_Sequence
    | Unicode_Escape_Sequence
    ;

fragment Single_Character
    // anything but ', \, and New_Line_Character
    : ~['\\\u000D\u000A\u0085\u2028\u2029]
    ;

fragment Simple_Escape_Sequence
    : '\\\'' | '\\"' | '\\\\' | '\\0' | '\\a' | '\\b' |
      '\\f' | '\\n' | '\\r' | '\\t' | '\\v'
    ;

fragment Hexadecimal_Escape_Sequence
    : '\\x' Hex_Digit Hex_Digit? Hex_Digit? Hex_Digit?
    ;

Note: A character that follows a backslash character (\) in a Character must be one of the following characters: ', ", \, 0, a, b, f, n, r, t, u, U, x, v. Otherwise, a compile-time error occurs. end note

Note: The use of the \x Hexadecimal_Escape_Sequence production can be error-prone and hard to read due to the variable number of hexadecimal digits following the \x. For example, in the code:
string good = "\x9Good text";
string bad = "\x9Bad text";
it might appear at first that the leading character is the same (U+0009, a tab character) in both strings. In fact the second string starts with U+9BAD as all three letters in the word “Bad” are valid hexadecimal digits. As a matter of style, it is recommended that \x is avoided in favour of either specific escape sequences (\t in this example) or the fixed-length \u escape sequence.

end note

A hexadecimal escape sequence represents a UTF-16 code unit, with the value formed by the hexadecimal number following “\x”.

If the value represented by a character literal is greater than U+FFFF, a compile-time error occurs.

A Unicode escape sequence (§6.4.2) in a character literal shall be in the range U+0000 to U+FFFF.

A simple escape sequence represents a Unicode character, as described in the table below.

Escape sequence	Character name	Unicode code point
`\'`	Single quote	U+0027
`\"`	Double quote	U+0022
`\\`	Backslash	U+005C
`\0`	Null	U+0000
`\a`	Alert	U+0007
`\b`	Backspace	U+0008
`\f`	Form feed	U+000C
`\n`	New line	U+000A
`\r`	Carriage return	U+000D
`\t`	Horizontal tab	U+0009
`\v`	Vertical tab	U+000B

The type of a Character_Literal is char.

6.4.5.6 String literals

C# supports a number of forms of string literals: regular string literals, verbatim string literals, and raw string literals. A regular string literal consists of zero or more characters enclosed in double quotes, as in "hello", and can include both simple escape sequences (such as \t for the tab character), and hexadecimal and Unicode escape sequences.

A verbatim string literal consists of an @ character followed by a double-quote character, zero or more characters, and a closing double-quote character.

Example: A simple example is @"hello". end example

In a verbatim string literal, the characters between the delimiters are interpreted verbatim, with the only exception being a Quote_Escape_Sequence, which represents one double-quote character. In particular, simple escape sequences, and hexadecimal and Unicode escape sequences are not processed in verbatim string literals. A verbatim string literal may span multiple lines.

A raw string literal consists of arbitrary text and newlines between multi-"-sequence delimiters (which better supports the readability of XML, JSON, and other forms of text that have some visually pleasing structure). A raw string literal may span multiple lines.

All string literal forms may optionally have a trailing Utf8_Suffix. The representation of each form is discussed below.

String_Literal
    : Regular_String_Literal
    | Verbatim_String_Literal
    | Raw_String_Literal
    ;

fragment Regular_String_Literal
    : '"' Regular_String_Literal_Character* '"' Utf8_Suffix?
    ;

fragment Regular_String_Literal_Character
    : Single_Regular_String_Literal_Character
    | Simple_Escape_Sequence
    | Hexadecimal_Escape_Sequence
    | Unicode_Escape_Sequence
    ;

fragment Single_Regular_String_Literal_Character
    // anything but ", \, and New_Line_Character
    : ~["\\\u000D\u000A\u0085\u2028\u2029]
    ;

fragment Verbatim_String_Literal
    : '@"' Verbatim_String_Literal_Character* '"' Utf8_Suffix?
    ;

fragment Verbatim_String_Literal_Character
    : Single_Verbatim_String_Literal_Character
    | Quote_Escape_Sequence
    ;

fragment Single_Verbatim_String_Literal_Character
    : ~["]     // anything but quotation mark (U+0022)
    ;

fragment Quote_Escape_Sequence
    : '""'
    ;

fragment Raw_String_Literal
    : Single_Line_Raw_String_Literal
    | Multi_Line_Raw_String_Literal
    ;

fragment Single_Line_Raw_String_Literal
    : Raw_String_Literal_Delimiter  Raw_String_Literal Content
      Raw_String_Literal_Delimiter
    ;

fragment Raw_String_Literal_Delimiter
    : '"""'  '"'*
    ;

fragment Raw_String_Literal Content
    // anything except New_Line
    : ~( '\u000D\u000A' | '\u000D' | '\u000A' | '\u0085' | '\u2028' | '\u2029')
    ;

fragment Multi_Line_Raw_String_Literal
    : Raw_String_Literal_Delimiter Whitespace* New_Line
      (Raw_String_Literal Content | New_Line)* New_Line
      Whitespace* Raw_String_Literal_Delimiter
    ;

fragment Utf8_Suffix
    : 'u8' | 'U8'
    ;

For brevity, a Raw_String_Literal_Delimiter is referred to as a “delimiter,” the start Raw_String_Literal_Delimiter is referred to as the “start delimiter,” and the end Raw_String_Literal_Delimiter is referred to as the “end delimiter.”

For any Raw_String_Literal:

A delimiter shall be the longest set of contiguous " characters found at the start or end. The number of " characters in a delimiter is called the raw string literal delimiter length.

Example: The string """ """ is well-formed; it has 3-character start and end delimiters, and its content is a single space. However, the string """""" is ill-formed, as it is seen as a 6-character start delimiter, with no content, and no end delimiter, not as 3-character start and end delimiters and empty content. end example
The beginning and end delimiters shall have the same raw string literal delimiter length.

Example: The string """"X"""" is well-formed; it has 4-character start and end delimiters. However, the strings """X"""" and """"X""" are ill-formed, as the start and end delimiters in each pair do not have the same length. end example
A Raw_String_Literal Content shall not contain a set of contiguous " characters whose length is equal to or greater than the raw string literal delimiter length.

Example: The strings """" """ """" and """"""" """""" """"" """" """ """""""are well-formed. However, the strings """ """ """ and """ """" """ are ill-formed. end example
As text sequences that have the form of Comments are not processed within string literals (§6.3.3), they appear verbatim in their corresponding Raw_String_Literal Content.

For a Single_Line_Raw_String_Literal only:

A Single_Line_Raw_String_Literal cannot be empty; it must contain at least one character.
A Raw_String_Literal Content cannot begin with ", as such a character is considered to belong to the preceding start delimiter. Similarly, a Raw_String_Literal Content cannot end with ", as such a character is considered to belong to the following end delimiter.
The value of the literal is Raw_String_Literal Content, which can contain leading, embedded, and trailing horizontal whitespace (as in """x x x""" and """ xxx """, the latter having a leading space and trailing tabs).

For a Multi_Line_Raw_String_Literal only:

If Whitespace precedes the end delimiter on the same line, the exact number and kind of whitespace characters (e.g., spaces vs. tabs) shall exist at the beginning of each Raw_String_Literal Content, and that leading whitespace shall be discarded from those Raw_String_Literal Contents.
A Raw_String_Literal Content shall not appear on the same line as a start or end delimiter.
A Multi_Line_Raw_String_Literal can be empty (by having no Raw_String_Literal Contents and one or more New_Lines).
A Raw_String_Literal Content can begin or end with ".
The value of the literal is the lexical concatenation of all of its Raw_String_Literal Contents and New_Lines after any whitespace at the beginning of each Raw_String_Literal Content has been discarded based on whitespace preceding the ending delimiter. Whitespace following the start delimiter and preceding the end delimiter are not included.

Example: The example
string a = "Happy birthday, Joel"; // Happy birthday, Joel
string b = @"Happy birthday, Joel"; // Happy birthday, Joel
string c = "hello \t world"; // hello world
string d = @"hello \t world"; // hello \t world
string e = "Joe said \"Hello\" to me"; // Joe said "Hello" to me
string f = @"Joe said ""Hello"" to me"; // Joe said "Hello" to me
string g = "\\\\server\\share\\file.txt"; // \\server\share\file.txt
string h = @"\\server\share\file.txt"; // \\server\share\file.txt
string i = "one\r\ntwo\r\nthree";
string j = @"one
two
three";
shows a variety of string literals. The last string literal, j, is a verbatim string literal that spans multiple lines. The characters between the quotation marks, including white space such as new line characters, are preserved verbatim, and each pair of double-quote characters is replaced by one such character.

end example

Example: Consider the following multi-line string literals:
var xml1= """
        <element attr="content">
          <body>
          </body>
        </element>
        """;
Console.WriteLine(xml1);

var xml2 = """
        <element attr="content">
          <body>
          </body>
        </element>
    """;
Console.WriteLine(xml2);

var xml3 = """
        <element attr="content">
          <body>
          </body>
        </element>
""";
Console.WriteLine(xml3);
which produces the output
<element attr="content">
  <body>
  </body>
</element>
    <element attr="content">
      <body>
      </body>
    </element>
        <element attr="content">
          <body>
          </body>
        </element>
In the case of xml1, the end delimiter has 8 leading spaces, so that is the amount of leading whitespace removed from each content line. With xm12, 4 leading spaces are removed, and with xml3, no leading spaces are removed. end example

Note: Any line breaks within verbatim string literals are part of the resulting string. If the exact characters used to form line breaks are semantically relevant to an application, any tools that translate line breaks in source code to different formats (between “\n” and “\r\n”, for example) will change application behavior. Developers should be careful in such situations. end note

Note: Since a hexadecimal escape sequence can have a variable number of hex digits, the string literal "\x123" contains a single character with hex value 123. To create a string containing the character with hex value 12 followed by the character 3, one could write "\x00123" or "\x12" + "3" instead. end note

A String_Literal that does not contain a Utf8_Suffix is a UTF-16 string literal, whose type is string.

A String_Literal that contains a Utf8_Suffix is a UTF-8 string literal, whose type is System.ReadOnlySpan<byte> (an indexable collection type), and whose value contains a UTF-8 byte representation of the string. A null terminator (a byte with value zero) is placed beyond the last byte in memory (and outside the length of the ReadOnlySpan<byte>) in order to support scenarios that expect null-terminated byte strings. A UTF-8 string literal is not a constant. A UTF-8 string literal without its Utf8_Suffix shall be valid UTF-16. (For example, "\uDC00\uDD00"u8 is ill-formed as one low surrogate cannot be followed by another.)

Note: While every UTF-8 string literal is a ReadOnlySpan<byte>, not every ReadOnlySpan<byte> represents a UTF-8 string literal. See the description of UTF-8 string concatenation in §12.13.5. end note

Note: As ReadOnlySpan<byte> is a ref struct type, a UTF-8 string literal cannot be converted to object or used as a type parameter (§16.2.3). end note

Example: Here are examples of each form of string literal:

Encoding Type Regular String Literal Verbatim String Literal Raw String Literal

UTF-16 string "Hello" @"Hello" """Hello"""

UTF-8 ReadOnlySpan<byte> "Hello"u8 @"Hello"u8 """Hello"""u8

end example

Encoding	Type	Regular String Literal	Verbatim String Literal	Raw String Literal
UTF-16	`string`	`"Hello"`	`@"Hello"`	`"""Hello"""`
UTF-8	`ReadOnlySpan<byte>`	`"Hello"u8`	`@"Hello"u8`	`"""Hello"""u8`

Each string literal does not necessarily result in a new string instance. When two or more string literals that are equivalent according to the string equality operator (§12.15.8), appear in the same assembly, these string literals refer to the same string instance.

Example: For instance, the output produced by

class Test
{
    static void Main()
    {
        object a = "hello";
        object b = "hello";
        object c = @"hello";
        object d = """hello""";
        object e = """
          hello
          """;

        System.Console.WriteLine(a == b);
        System.Console.WriteLine(a == c);
        System.Console.WriteLine(a == d);
        System.Console.WriteLine(a == e);
    }
}

is all True because the five literals refer to the same string instance.

end example

6.4.5.7 The null literal

null_literal
    : NULL
    ;

Note: null_literal is a parser rule as it does not introduce a new token kind. end note

A null_literal represents a null value. It does not have a type, but can be converted to any reference type or nullable value type through a null literal conversion (§10.2.7).

6.4.6 Operators and punctuators

There are several kinds of operators and punctuators. Operators are used in expressions to describe operations involving one or more operands.

Example: The expression a + b uses the + operator to add the two operands a and b. end example

Punctuators are for grouping and separating.

operator_or_punctuator
    : '{'  | '}'  | '['  | ']'  | '('   | ')'  | '.'   | ','  | ':'  | ';'
    | '+'  | '-'  | '*'  | '/'  | '%'   | '&'  | '|'   | '^'  | '!'  | '~'
    | '='  | '<'  | '>'  | '?'  | '??'  | '::' | '++'  | '--' | '&&' | '||'
    | '->' | '==' | '!=' | '<=' | '>='  | '+=' | '-='  | '*=' | '/=' | '%='
    | '&=' | '|=' | '^=' | '<<' | '<<=' | '=>' | '??=' | '..'
    ;

right_shift
    : '>'  '>'
    ;

right_shift_assignment
    : '>' '>='
    ;

unsigned_right_shift
    : '>'  '>'  '>'
    ;

unsigned_right_shift_assignment
    : '>'  '>'  '>='
    ;

Note: right_shift and right_shift_assignment are parser rules as they do not introduce a new token kind but represent a sequence of two tokens. Similarly, unsigned_right_shift and unsigned_right_shift_assignment are parser rules as they do not introduce a new token kind but represent a sequence of three tokens. The operator_or_punctuator rule exists for descriptive purposes only and is not used elsewhere in the grammar. end note

right_shift is made up of the two tokens > and >, and unsigned_right_shift is made up of the three tokens >, >, and >. Similarly, right_shift_assignment is made up of the two tokens > and >=, and unsigned_right_shift_assignment is made up of the three tokens >, >, and >=. Unlike other productions in the syntactic grammar, no characters of any kind (not even whitespace) are allowed between the two tokens in each of these productions. These productions are treated specially in order to enable the correct handling of type_parameter_lists (§15.2.3).

Note: Prior to the addition of generics to C#, >> and >>= were both single tokens. However, the syntax for generics uses the < and > characters to delimit type parameters and type arguments. It is often desirable to use nested constructed types, such as List<Dictionary<string, int>>. Rather than requiring the programmer to separate the > and > by a space, the definition of the two operator_or_punctuators was changed. end note

6.5 Pre-processing directives

6.5.1 General

The pre-processing directives provide the ability to conditionally skip sections of compilation units, to report error and warning conditions, to delineate distinct regions of source code, and to set the nullable context.

Note: The term “pre-processing directives” is used only for consistency with the C and C++ programming languages. In C#, there is no separate pre-processing step; pre-processing directives are processed as part of the lexical analysis phase. end note

PP_Directive
    : PP_Start PP_Kind PP_New_Line
    ;

fragment PP_Kind
    : PP_Declaration
    | PP_Conditional
    | PP_Line
    | PP_Diagnostic
    | PP_Region
    | PP_Pragma
    | PP_Nullable
    ;

// Only recognised at the beginning of a line
fragment PP_Start
    // See note below.
    : { getCharPositionInLine() == 0 }? PP_Whitespace? '#' PP_Whitespace?
    ;

fragment PP_Whitespace
    : ( [\p{Zs}]  // any character with Unicode class Zs
      | '\u0009'  // horizontal tab
      | '\u000B'  // vertical tab
      | '\u000C'  // form feed
      )+
    ;

fragment PP_New_Line
    : PP_Whitespace? Single_Line_Comment? New_Line
    ;

Note:

The pre-processor grammar defines a single lexical token PP_Directive used for all pre-processing directives. The semantics of each of the pre-processing directives are defined in this language specification but not how to implement them.

The PP_Start fragment must only be recognised at the start of a line, the getCharPositionInLine() == 0 ANTLR lexical predicate above suggests one way in which this may be achieved and is informative only, an implementation may use a different strategy.

end note

The following pre-processing directives are available:

#define and #undef, which are used to define and undefine, respectively, conditional compilation symbols (§6.5.4).
#if, #elif, #else, and #endif, which are used to skip conditionally sections of source code (§6.5.5).
#line, which is used to control line numbers and source file mapping emitted for errors and warnings (§6.5.8).
#error, which is used to issue errors (§6.5.6).
#region and #endregion, which are used to explicitly mark sections of source code (§6.5.7).
#nullable, which is used to specify the nullable context (§6.5.9).
#pragma, which is used to specify optional contextual information to a compiler (§6.5.10).

A pre-processing directive always occupies a separate line of source code and always begins with a # character and a pre-processing directive name. White space may occur before the # character and between the # character and the directive name.

A source line containing a #define, #undef, #if, #elif, #else, #endif, #line, #endregion, or #nullable directive can end with a single-line comment. Delimited comments (the /* */ style of comments) are not permitted on source lines containing pre-processing directives.

Pre-processing directives are not part of the syntactic grammar of C#. However, pre-processing directives can be used to include or exclude sequences of tokens and can in that way affect the meaning of a C# program.

Example: When compiled, the program
#define A
#undef B
class C
{
#if A
    void F() {}
#else
    void G() {}
#endif
#if B
    void H() {}
#else
    void I() {}
#endif
}
results in the exact same sequence of tokens as the program
class C
{
    void F() {}
    void I() {}
}
Thus, whereas lexically, the two programs are quite different, syntactically, they are identical.

end example

6.5.2 Conditional compilation symbols

The conditional compilation functionality provided by the #if, #elif, #else, and #endif directives is controlled through pre-processing expressions (§6.5.3) and conditional compilation symbols.

fragment PP_Conditional_Symbol
    // Must not be equal to tokens TRUE or FALSE. See note below.
    : Basic_Identifier
    ;

Note How an implementation enforces the restriction on the allowable Basic_Identifier values is an implementation issue. end note

Two conditional compilation symbols are considered the same if they are identical after the following transformations are applied, in order:

Each Unicode_Escape_Sequence is transformed into its corresponding Unicode character.
Any Formatting_Characters are removed.

A conditional compilation symbol has two possible states: defined or undefined. At the beginning of the lexical processing of a compilation unit, a conditional compilation symbol is undefined unless it has been explicitly defined by an external mechanism (such as a command-line compiler option). When a #define directive is processed, the conditional compilation symbol named in that directive becomes defined in that compilation unit. The symbol remains defined until a #undef directive for that same symbol is processed, or until the end of the compilation unit is reached. An implication of this is that #define and #undef directives in one compilation unit have no effect on other compilation units in the same program.

When referenced in a pre-processing expression (§6.5.3), a defined conditional compilation symbol has the Boolean value true, and an undefined conditional compilation symbol has the Boolean value false. There is no requirement that conditional compilation symbols be explicitly declared before they are referenced in pre-processing expressions. Instead, undeclared symbols are simply undefined and thus have the value false.

The namespace for conditional compilation symbols is distinct and separate from all other named entities in a C# program. Conditional compilation symbols can only be referenced in #define and #undef directives and in pre-processing expressions.

6.5.3 Pre-processing expressions

Pre-processing expressions can occur in #if and #elif directives. The operators ! (prefix logical negation only), ==, !=, &&, and || are permitted in pre-processing expressions, and parentheses may be used for grouping.

fragment PP_Expression
    : PP_Whitespace? PP_Or_Expression PP_Whitespace?
    ;

fragment PP_Or_Expression
    : PP_And_Expression (PP_Whitespace? '||' PP_Whitespace? PP_And_Expression)*
    ;

fragment PP_And_Expression
    : PP_Equality_Expression (PP_Whitespace? '&&' PP_Whitespace?
      PP_Equality_Expression)*
    ;

fragment PP_Equality_Expression
    : PP_Unary_Expression (PP_Whitespace? ('==' | '!=') PP_Whitespace?
      PP_Unary_Expression)*
    ;

fragment PP_Unary_Expression
    : PP_Primary_Expression
    | '!' PP_Whitespace? PP_Unary_Expression
    ;

fragment PP_Primary_Expression
    : TRUE
    | FALSE
    | PP_Conditional_Symbol
    | '(' PP_Whitespace? PP_Expression PP_Whitespace? ')'
    ;

When referenced in a pre-processing expression, a defined conditional compilation symbol has the Boolean value true, and an undefined conditional compilation symbol has the Boolean value false.

Evaluation of a pre-processing expression always yields a Boolean value. The rules of evaluation for a pre-processing expression are the same as those for a constant expression (§12.26), except that the only user-defined entities that can be referenced are conditional compilation symbols.

6.5.4 Definition directives

The definition directives are used to define or undefine conditional compilation symbols.

fragment PP_Declaration
    : 'define' PP_Whitespace PP_Conditional_Symbol
    | 'undef' PP_Whitespace PP_Conditional_Symbol
    ;

The processing of a #define directive causes the given conditional compilation symbol to become defined, starting with the source line that follows the directive. Likewise, the processing of a #undef directive causes the given conditional compilation symbol to become undefined, starting with the source line that follows the directive.

Any #define and #undef directives in a compilation unit shall occur before the first token (§6.4) in the compilation unit; otherwise a compile-time error occurs. In intuitive terms, #define and #undef directives shall precede any “real code” in the compilation unit.

Example: The example:
#define Enterprise
#if Professional || Enterprise
#define Advanced
#endif
namespace Megacorp.Data
{
#if Advanced
    class PivotTable {...}
#endif
}
is valid because the #define directives precede the first token (the namespace keyword) in the compilation unit.

end example

Example: The following example results in a compile-time error because a #define follows real code:
#define A
namespace N
{
#define B
#if B
    class Class1 {}
#endif
}
end example

A #define may define a conditional compilation symbol that is already defined, without there being any intervening #undef for that symbol.

Example: The example below defines a conditional compilation symbol A and then defines it again.
#define A
#define A
For compilers that allow conditional compilation symbols to be defined as compilation options, an alternative way for such redefinition to occur is to define the symbol as a compiler option as well as in the source.

end example

A #undef may “undefine” a conditional compilation symbol that is not defined.

Example: The example below defines a conditional compilation symbol A and then undefines it twice; although the second #undef has no effect, it is still valid.
#define A
#undef A
#undef A
end example

6.5.5 Conditional compilation directives

The conditional compilation directives are used to conditionally include or exclude portions of a compilation unit.

fragment PP_Conditional
    : PP_If_Section
    | PP_Elif_Section
    | PP_Else_Section
    | PP_Endif
    ;

fragment PP_If_Section
    : 'if' PP_Whitespace PP_Expression
    ;

fragment PP_Elif_Section
    : 'elif' PP_Whitespace PP_Expression
    ;

fragment PP_Else_Section
    : 'else'
    ;

fragment PP_Endif
    : 'endif'
    ;

Conditional compilation directives shall be written in groups consisting of, in order, a #if directive, zero or more #elif directives, zero or one #else directive, and a #endif directive. Between the directives are conditional sections of source code. Each section is controlled by the immediately preceding directive. A conditional section may itself contain nested conditional compilation directives provided these directives form complete groups.

At most one of the contained conditional sections is selected for normal lexical processing:

The PP_Expressions of the #if and #elif directives are evaluated in order until one yields true. If an expression yields true, the conditional section following the corresponding directive is selected.
If all PP_Expressions yield false, and if a #else directive is present, the conditional section following the #else directive is selected.
Otherwise, no conditional section is selected.

The selected conditional section, if any, is processed as a normal input_section: the source code contained in the section shall adhere to the lexical grammar; tokens are generated from the source code in the section; and pre-processing directives in the section have the prescribed effects.

Any remaining conditional sections are skipped and no tokens, except those for pre-processing directives, are generated from the source code. Therefore skipped source code, except pre-processing directives, may be lexically incorrect. Skipped pre-processing directives shall be lexically correct but are not otherwise processed. Within a conditional section that is being skipped any nested conditional sections (contained in nested #if...#endif constructs) are also skipped.

Note: The above grammar does not capture the allowance that the conditional sections between the pre-processing directives may be malformed lexically. Therefore the grammar is not ANTLR-ready as it only supports lexically correct input. end note

Example: The following example illustrates how conditional compilation directives can nest:
#define Debug // Debugging on
#undef Trace // Tracing off
class PurchaseTransaction
{
    void Commit()
    {
#if Debug
        CheckConsistency();
    #if Trace
        WriteToLog(this.ToString());
    #endif
#endif
        CommitHelper();
    }
    ...
}
Except for pre-processing directives, skipped source code is not subject to lexical analysis. For example, the following is valid despite the unterminated comment in the #else section:
#define Debug // Debugging on
class PurchaseTransaction
{
    void Commit()
    {
#if Debug
        CheckConsistency();
#else
        /* Do something else
#endif
    }
    ...
}
Note, however, that pre-processing directives are required to be lexically correct even in skipped sections of source code.

Pre-processing directives are not processed when they appear inside multi-line input elements. For example, the program:
class Hello
{
    static void Main()
    {
        System.Console.WriteLine(@"hello,
#if Debug
        world
#else
        Nebraska
#endif
        ");
    }
}
results in the output:
hello,
#if Debug
        world
#else
        Nebraska
#endif
In peculiar cases, the set of pre-processing directives that is processed might depend on the evaluation of the pp_expression. The example:
#if X
    /*
#else
    /* */ class Q { }
#endif
always produces the same token stream (class Q { }), regardless of whether or not X is defined. If X is defined, the only processed directives are #if and #endif, due to the multi-line comment. If X is undefined, then three directives (#if, #else, #endif) are part of the directive set.

end example

6.5.6 Diagnostic directives

The diagnostic directives are used to generate explicitly error and warning messages that are reported in the same way as other compile-time errors and warnings.

fragment PP_Diagnostic
    : 'error' PP_Message?
    | 'warning' PP_Message?
    ;

fragment PP_Message
    : PP_Whitespace Input_Character*
    ;

Example: The example
#if Debug && Retail
    #error A build can't be both debug and retail
#endif
class Test {...}
produces a compile-time error (“A build can’t be both debug and retail”) if the conditional compilation symbols Debug and Retail are both defined. Note that a PP_Message can contain arbitrary text; specifically, it need not contain well-formed tokens, as shown by the single quote in the word can't.

end example

6.5.7 Region directives

The region directives are used to mark explicitly regions of source code.

fragment PP_Region
    : PP_Start_Region
    | PP_End_Region
    ;

fragment PP_Start_Region
    : 'region' PP_Message?
    ;

fragment PP_End_Region
    : 'endregion' PP_Message?
    ;

No semantic meaning is attached to a region; regions are intended for use by the programmer or by automated tools to mark a section of source code. There shall be one #endregion directive matching every #region directive. The message specified in a #region or #endregion directive likewise has no semantic meaning; it merely serves to identify the region. Matching #region and #endregion directives may have different PP_Messages.

The lexical processing of a region:

#region
...
#endregion

corresponds exactly to the lexical processing of a conditional compilation directive of the form:

#if true
...
#endif

Note: This means that a region can include one or more #if/.../#endif, or be contained with a conditional section within a #if/.../#endif; but a region cannot overlap with an just part of an #if/.../#endif, or start & end in different conditional sections. end note

6.5.8 Line directives

Line directives may be used to alter the line numbers and compilation unit names that are reported by a compiler in output such as warnings and errors. These values are also used by caller-info attributes (§23.5.6).

Note: Line directives are most commonly used in meta-programming tools that generate C# source code from some other text input. The information these directives provide might be used for debugging purposes and error handling, by allowing mapping of errors back to the original source file, rather than to the generated intermediate file. end note

fragment PP_Line
    : 'line' PP_Whitespace PP_Line_Indicator
    ;

fragment PP_Line_Indicator
    : Decimal_Digit+ PP_Whitespace PP_Compilation_Unit_Name
    | Decimal_Digit+
    | DEFAULT
    | 'hidden'
    | PP_Start_Line_Character PP_Whitespace? '-' PP_Whitespace? PP_End_Line_Character 
      PP_Whitespace (PP_Character_Offset PP_Whitespace)? PP_Compilation_Unit_Name
    ;

fragment PP_Compilation_Unit_Name
    : '"' PP_Compilation_Unit_Name_Character* '"'
    ;

fragment PP_Compilation_Unit_Name_Character
    // Any Input_Character except "
    : ~('\u000D' | '\u000A'   | '\u0085' | '\u2028' | '\u2029' | '"')
    ;

fragment PP_Start_Line_Character
    : '(' PP_Whitespace? PP_Start_Line PP_Whitespace? ',' PP_Whitespace?
      PP_Start_Character PP_Whitespace? ')'
    ;
fragment PP_End_Line_Character
    : '(' PP_Whitespace? PP_End_Line PP_Whitespace? ',' PP_Whitespace?
      PP_End_Character PP_Whitespace? ')'
    ;

fragment PP_Start_Line
    : Decimal_Digit+
    ;

fragment PP_End_Line
    : Decimal_Digit+
    ;

fragment PP_Start_Character
    : Decimal_Digit+
    ;

fragment PP_End_Character
    : Decimal_Digit+
    ;

fragment PP_Character_Offset
    : Decimal_Digit+
    ;

When no #line directives are present, a compiler reports true line numbers and compilation unit names in its output. When processing a #line directive that includes a PP_Line_Indicator that is not default, a compiler treats the line after the directive as having the given line number (and compilation unit name, if specified).

The maximum value allowed for Decimal_Digit+ is implementation-defined.

A #line default directive undoes the effect of all preceding #line directives. A compiler reports true line information for subsequent lines, precisely as if no #line directives had been processed.

A #line hidden directive has no effect on the source location information reported in error messages, or produced by use of CallerLineNumberAttribute (§23.5.6.2). It is intended to affect source-level debugging tools so that, when debugging, all lines between a #line hidden directive and the subsequent #line directive (that is not #line hidden) have no line number information, and are skipped entirely when stepping through code.

Note: Although a PP_Compilation_Unit_Name might contain text that looks like an escape sequence, such text is not an escape sequence; in this context a ‘\’ character simply designates an ordinary backslash character. end note

Together, the tokens PP_Start_Line_Character ‘-’ PP_End_Line_Character specify a span of characters in the source file identified by PP_Compilation_Unit_Name (the mapped file).

PP_Start_Line_Character represents a start position in the mapped file, specified as a line (PP_Start_Line) and column (PP_Start_Character) pair; for example, (1,1). This position corresponds to the first character on the line following the directive in the generated code.

PP_End_Line_Character represents an end position in the mapped file, specified as a line (PP_End_Line) and column (PP_End_Character) pair; for example, (3,10).

PP_Start_Line and PP_End_Line are positive integers that specify line numbers. PP_Start_Character and PP_End_Character are positive integers that specify character numbers. All four of these numbers are 1-based, meaning that the first line of the mapped file and the first character on each line is assigned number 1.

The end position shall not precede the start position: PP_End_Line shall be greater than or equal to PP_Start_Line, and when PP_Start_Line equals PP_End_Line, PP_End_Character shall be greater than PP_Start_Character.

By default, the mapped text starts at the first character on the line following the #line directive. However, this can be adjusted using PP_Character_Offset. If PP_Character_Offset is omitted, it defaults to 0; otherwise, it specifies the number of characters to skip in that next line. That number shall be non-negative and less than the length of the line following the #line directive.

The mapping specified by a PP_Line is in scope until the following #line directive or the end of the compilation unit, whichever comes first.

Example: A code generator that produces C# from a template file can use the enhanced form of the #line directive to map diagnostics back to positions in the original source. If line 5, columns 3 through 17 of template.dsl contains Greet("Hello"), the generator might emit:
#line (5,3)-(5,17) "template.dsl"
output.Add(Greet("Hello"));
#line default
Any diagnostic produced for the output.Add(Greet("Hello")); statement would be reported at line 5, columns 3–17 in template.dsl.

The PP_Character_Offset form allows the generator to account for an inserted prefix. With a character offset of 11 for the prefix output.Add(:
#line (5,3)-(5,17) 11 "template.dsl"
output.Add(Greet("Hello"));
#line default
the offset causes the mapped span to begin at Greet("Hello") rather than at output.Add(, mapping Greet("Hello") to line 5, column 3 in template.dsl.

end example

6.5.9 Nullable directive

The nullable directive controls the nullable context, as described below.

fragment PP_Nullable
    : 'nullable' PP_Whitespace PP_Nullable_Action
      (PP_Whitespace PP_Nullable_Target)?
    ;
fragment PP_Nullable_Action
    : 'disable'
    | 'enable'
    | 'restore'
    ;
fragment PP_Nullable_Target
    : 'warnings'
    | 'annotations'
    ;

A nullable directive sets the available flags for subsequent lines of code, until another nullable directive overrides it, or until the end of the compilation _unit is reached. The nullable context contains two flags: annotations and warnings. The effect of each form of nullable directive is, as follows:

#nullable disable: Disables both nullable annotations and nullable warnings flags.
#nullable enable: Enables both nullable annotations and nullable warnings flags.
#nullable restore: Restores both the annotations and warnings flags to the state specified by the external mechanism, if any.
#nullable disable annotations: Disables the nullable annotations flag. The nullable warnings flag is unaffected.
#nullable enable annotations: Enables the nullable annotations flag. The nullable warnings flag is unaffected.
#nullable restore annotations: Restores the nullable annotations flag to the state specified by the external mechanism, if any. The nullable warnings flag is unaffected.
#nullable disable warnings: Disables the nullable warnings flag. The nullable annotations flag is unaffected.
#nullable enable warnings: Enables the nullable warnings flag. The nullable annotations flag is unaffected.
#nullable restore warnings: Restores the nullable warnings flag to the state specified by the external mechanism, if any. The nullable annotations flag is unaffected.

The nullable state of expressions is tracked at all times. The state of the annotation flag and the presence or absence of a nullable annotation, ?, determines the initial null state of a variable declaration. Warnings are only issued when the warnings flag is enabled.

Example: The example
#nullable disable
string x = null;
string y = "";
#nullable enable
Console.WriteLine(x.Length); // Warning
Console.WriteLine(y.Length);
produces a compile-time warning (“as x is null”). The nullable state of x is tracked everywhere. A warning is issued when the warnings flag is enabled.

end example

6.5.10 Pragma directives

The #pragma preprocessing directive is used to specify contextual information to a compiler.

Note: For example, a compiler might provide #pragma directives that

Enable or disable particular warning messages when compiling subsequent code.

Specify which optimizations to apply to subsequent code.

Specify information to be used by a debugger.

end note

fragment PP_Pragma
    : 'pragma' PP_Pragma_Text?
    ;

fragment PP_Pragma_Text
    : PP_Whitespace Input_Character*
    ;

The Input_Characters in the PP_Pragma_Text are interpreted by a compiler in an implementation-defined manner. The information supplied in a #pragma directive shall not change program semantics. A #pragma directive shall only change compiler behavior that is outside the scope of this language specification. If a compiler cannot interpret the Input_Characters, a compiler can produce a warning; however, it shall not produce a compile-time error.

Note: PP_Pragma_Text can contain arbitrary text; specifically, it need not contain well-formed tokens. end note

Palaute

Onko tästä sivusta apua?

Last updated on 2026-05-19

6 Lexical structure

6.1 Programs

6.2 Grammars

6.2.1 General

6.2.2 Grammar notation

6.2.3 Lexical grammar

6.2.4 Syntactic grammar

6.2.5 Grammar ambiguities

6.3 Lexical analysis

6.3.1 General

6.3.2 Line terminators

6.3.3 Comments

6.3.4 White space

6.4 Tokens

6.4.1 General

6.4.2 Unicode character escape sequences

6.4.3 Identifiers

6.4.4 Keywords

6.4.5 Literals

6.4.5.1 General

6.4.5.2 Boolean literals

6.4.5.3 Integer literals

6.4.5.4 Real literals

6.4.5.5 Character literals

6.4.5.6 String literals

6.4.5.7 The null literal

6.4.6 Operators and punctuators

6.5 Pre-processing directives

6.5.1 General

6.5.2 Conditional compilation symbols

6.5.3 Pre-processing expressions

6.5.4 Definition directives

6.5.5 Conditional compilation directives

6.5.6 Diagnostic directives

6.5.7 Region directives

6.5.8 Line directives

6.5.9 Nullable directive

6.5.10 Pragma directives

Palaute

Lisäresursseja