Lexical Structure
Documents
An M document is an ordered sequence of Unicode characters. M allows different classes of Unicode characters in different parts of an M document. For information on Unicode character classes, see The Unicode Standard, Version 3.0, section 4.5.
A document either consists of exactly one expression or of groups of definitions organized into sections. Sections are described in detail in Chapter 10. Conceptually speaking, the following steps are used to read an expression from a document:
The document is decoded according to its character encoding scheme into a sequence of Unicode characters.
Lexical analysis is performed, thereby translating the stream of Unicode characters into a stream of tokens. The remaining subsections of this section cover lexical analysis.
Syntactic analysis is performed, thereby translating the stream of tokens into a form that can be evaluated. This process is covered in subsequent sections.
Grammar conventions
The lexical and syntactic grammars are presented using grammar productions. Each grammar production defines a non-terminal symbol and the possible expansions of that nonterminal symbol into sequences of non-terminal or terminal symbols. In grammar productions, non-terminal+ symbols are shown in italic type, and terminal symbols are shown in a fixed-width font.
The first line of a grammar production is the name of the non-terminal symbol being defined, followed by a colon. Each successive indented line contains a possible expansion of the nonterminal given as a sequence of non-terminal or terminal symbols. For example, the production:
if-expression:
if
if-condition then
true-expression else
false-expression
defines an if-expression to consist of the token if
, followed by an if-condition, followed by the token then
, followed by a true-expression, followed by the token else
, followed by a false-expression.
When there is more than one possible expansion of a non-terminal symbol, the alternatives are listed on separate lines. For example, the production:
variable-list:
variable
variable-list ,
variable
defines a variable-list to either consist of a variable or consist of a variable-list followed by a variable. In other words, the definition is recursive and specifies that a variable list consists of one or more variables, separated by commas.
A subscripted suffix "opt" is used to indicate an optional symbol. The production:
field-specification:
optional
opt field-name =
field-type
is shorthand for:
field-specification:
field-name =
field-type
optional
field-name =
field-type
and defines a field-specification to optionally begin with the terminal symbol optional
followed by a field-name, the terminal symbol =
, and a field-type.
Alternatives are normally listed on separate lines, though in cases where there are many alternatives, the phrase "one of" may precede a list of expansions given on a single line. This is simply shorthand for listing each of the alternatives on a separate line. For example, the production:
decimal-digit: one of
0 1 2 3 4 5 6 7 8 9
is shorthand for:
decimal-digit:
0
1
2
3
4
5
6
7
8
9
Lexical Analysis
The lexical-unit production defines the lexical grammar for an M document. Every valid M document conforms to this grammar.
lexical-unit:
lexical-elementsopt
lexical-elements:
lexical-element
lexical-element
lexical-elements
lexical-element:
whitespace
token comment
At the lexical level, an M document consists of a stream of whitespace, comment, and token elements. Each of these productions is covered in the following sections. Only token elements are significant in the syntactic grammar.
Whitespace
Whitespace is used to separate comments and tokens within an M document. Whitespace includes the space character (which is part of Unicode class Zs), as well as horizontal and vertical tab, form feed, and newline character sequences. Newline character sequences include carriage return, line feed, carriage return followed by line feed, next line, and paragraph separator characters.
whitespace:
Any character with Unicode class Zs
Horizontal tab character (U+0009
)
Vertical tab character (U+000B
)
Form feed character (U+000C
)
Carriage return character (U+000D
) followed by line feed character (U+000A
)
new-line-character
new-line-character:
Carriage return character (U+000D
)
Line feed character (U+000A
)
Next line character (U+0085
)
Line separator character (U+2028
)
Paragraph separator character (U+2029
)
For compatibility with source code editing tools that add end-of-file markers, and to enable a document to be viewed as a sequence of properly terminated lines, the following transformations are applied, in order, to an M document:
If the last character of the document is a Control-Z character (
U+001A
), this character is deleted.A carriage-return character (
U+000D
) is added to the end of the document if that document is non-empty and if the last character of the document is not a carriage return (U+000D
), a line feed (U+000A
), a line separator (U+2028
), or a paragraph separator (U+2029
).
Comments
Two forms of comments are supported: single-line comments and delimited comments. Single-line comments start with the characters //
and extend to the end of the source line. Delimited comments start with the characters /*
and end with the characters */
.
Delimited comments may span multiple lines.
comment:
single-line-comment
delimited-comment
single-line-comment:
//
single-line-comment-charactersopt
single-line-comment-characters:
single-line-comment-character single-line-comment-charactersopt
single-line-comment-character:
Any Unicode character except a new-line-character
delimited-comment:
/*
delimited-comment-textopt asterisks /
delimited-comment-text:
delimited-comment-section delimited-comment-textopt
delimited-comment-section:
/
asterisksopt not-slash-or-asterisk
asterisks:
*
asterisksopt
not-slash-or-asterisk:
Any Unicode character except *
or /
Comments do not nest. The character sequences /*
and */
have no special meaning within a single-line comment, and the character sequences //
and /*
have no special meaning within a delimited comment.
Comments are not processed within text literals. The example
/* Hello, world
*/
"Hello, world"
includes a delimited comment.
The example
// Hello, world
//
"Hello, world" // This is an example of a text literal
shows several single-line comments.
Tokens
A token is an identifier, keyword, literal, operator, or punctuator. Whitespace and comments are used to separate tokens, but are not considered tokens.
token:
identifier
keyword
literal
operator-or-punctuator
Character Escape Sequences
M text values can contain arbitrary Unicode characters. Text literals, however, are limited to graphic characters and require the use of escape sequences for non-graphic characters. For example, to include a carriage-return, linefeed, or tab character in a text literal, the #(cr)
, #(lf)
, and #(tab)
escape sequences can be used, respectively. To embed the escapesequence start characters #(
in a text literal, the #
itself needs to be escaped:
#(#)(
Escape sequences can also contain short (four hex digits) or long (eight hex digits) Unicode code-point values. The following three escape sequences are therefore equivalent:
#(000D) // short Unicode hexadecimal value
#(0000000D) // long Unicode hexadecimal value
#(cr) // compact escape shorthand for carriage return
Multiple escape codes can be included in a single escape sequence, separated by commas; the following two sequences are thus equivalent:
#(cr,lf)
#(cr)#(lf)
The following describes the standard mechanism of character escaping in an M document.
character-escape-sequence:
#(
escape-sequence-list )
escape-sequence-list:
single-escape-sequence
single-escape-sequence ,
escape-sequence-list
single-escape-sequence:
long-unicode-escape-sequence
short-unicode-escape-sequence
control-character-escape-sequence
escape-escape
long-unicode-escape-sequence:
hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit
short-unicode-escape-sequence:
hex-digit hex-digit hex-digit hex-digit
control-character-escape-sequence:
control-character
control-character:
cr
lf
tab
escape-escape:
#
Literals
A literal is a source code representation of a value.
literal:
logical-literal
number-literal
text-literal
null-literal
verbatim-literal
Null literals
The null literal is used to write the null
value. The null
value represents an absent value.
null-literal:
null
Logical literals
A logical literal is used to write the values true
and false
and produces a logical value.
logical-literal:
true
false
Number literals
A number literal is used to write a numeric value and produces a number value.
number-literal:
decimal-number-literal
hexadecimal-number-literal
decimal-number-literal:
decimal-digits .
decimal-digits exponent-partopt
.
decimal-digits exponent-partopt
decimal-digits exponent-partopt
decimal-digits:
decimal-digit decimal-digitsopt
decimal-digit: one of
0 1 2 3 4 5 6 7 8 9
exponent-part:
e
signopt decimal-digits
E
signopt decimal-digits
sign: one of
+ -
hexadecimal-number-literal:
0x
hex-digits
0X
hex-digits
hex-digits:
hex-digit hex-digitsopt
hex-digit: one of
0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
A number can be specified in hexadecimal format by preceding the hex-digits with the characters 0x
. For example:
0xff // 255
Note that if a decimal point is included in a number literal, then it must have at least one digit following it. For example, 1.3
is a number literal but 1.
and 1.e3
are not.
Text literals
A text literal is used to write a sequence of Unicode characters and produces a text value.
text-literal:
"
text-literal-charactersopt "
text-literal-characters:
text-literal-character text-literal-charactersopt
text-literal-character:
single-text-character
character-escape-sequence
double-quote-escape-sequence
single-text-character:
Any character except "
(U+0022
) or #
(U+0023
) followed by (
(U+0028
)
double-quote-escape-sequence:
""
(U+0022
, U+0022
)
To include quotes in a text value, the quote mark is repeated, as follows:
"The ""quoted"" text" // The "quoted" text
The character-escape-sequence production can be used to write characters in text values without having to directly encode them as Unicode characters in the document. For example, a carriage return and line feed can be written in a text value as:
"Hello world#(cr,lf)"
Verbatim literals
A verbatim literal is used to store a sequence of Unicode characters that were entered by a user as code, but which cannot be correctly parsed as code. At runtime, it produces an error value.
verbatim-literal:
#!"
text-literal-charactersopt "
Identifiers
An identifier is a name used to refer to a value. Identifiers can either be regular identifiers or quoted identifiers.
identifier:
regular-identifier
quoted-identifier
regular-identifier:
available-identifier
available-identifier dot-character regular-identifier
available-identifier:
A keyword-or-identifier that is not a keyword
keyword-or-identifier:
identifier-start-character identifier-part-charactersopt
identifier-start-character:
letter-character
underscore-character
identifier-part-characters:
identifier-part-character identifier-part-charactersopt
identifier-part-character:
letter-character
decimal-digit-character
underscore-character
connecting-character
combining-character
formatting-character
dot-character:
.
(U+002E
)
underscore-character:
_
(U+005F
)
letter-character:
A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
combining-character:
A Unicode character of classes Mn or Mc
decimal-digit-character:
A Unicode character of the class Nd
connecting-character:
A Unicode character of the class Pc
formatting-character:
A Unicode character of the class Cf
A quoted-identifier can be used to allow any sequence of zero or more Unicode characters to be used as an identifier, including keywords, whitespace, comments, operators and punctuators.
quoted-identifier:
#"
text-literal-charactersopt "
Note that escape sequences and double-quotes to escape quotes can be used in a quoted identifier, just as in a text-literal.
The following example uses identifier quoting for names containing a space character:
[
#"1998 Sales" = 1000,
#"1999 Sales" = 1100,
#"Total Sales" = #"1998 Sales" + #"1999 Sales"
]
The following example uses identifier quoting to include the +
operator in an identifier:
[
#"A + B" = A + B,
A = 1,
B = 2
]
Generalized Identifiers
There are two places in M where no ambiguities are introduced by identifiers that contain blanks or that are otherwise keywords or number literals. These places are the names of record fields in a record literal and in a field access operator ([ ]
) There, M allows such identifiers without having to use quoted identifiers.
[
Data = [ Base Line = 100, Rate = 1.8 ],
Progression = Data[Base Line] * Data[Rate]
]
The identifiers used to name and access fields are referred to as generalized identifiers and defined as follows:
generalized-identifier:
generalized-identifier-part
generalized-identifier separated only by blanks (U+0020
)
generalized-identifier-part
generalized-identifier-part:
generalized-identifier-segment
decimal-digit-character generalized-identifier-segment
generalized-identifier-segment:
keyword-or-identifier
keyword-or-identifier dot-character keyword-or-identifier
Keywords
A keyword is an identifier-like sequence of characters that is reserved, and cannot be used as an identifier except when using the identifier-quoting mechanism or where a generalized identifier is allowed.
keyword: one of
and as each else error false if in is let meta not null or otherwise
section shared then true try type #binary #date #datetime
#datetimezone #duration #infinity #nan #sections #shared #table #time
Operators and punctuators
There are several kinds of operators and punctuators. Operators are used in expressions to describe operations involving one or more operands. For example, the expression a + b
uses the +
operator to add the two operands a
and b
. Punctuators are for grouping and separating.
operator-or-punctuator: one of
, ; = < <= > >= <> + - * / & ( ) [ ] { } @ ! ? ?? => .. ...