Share via


Data Manipulation Using Expression

Note: as ML.NET hasn't implement Expression transform, this notebook is disabled. ##TODO: waiting for ML.NET to add expression transform.

Just like pandas dataframe, users can use (lambda) Expression transform to modify existing column values or produce new columns. In this tutorial, we show a few examples to demonstrate the usage.

The Expression transform is a very powerful function for data manipulation. The syntax follows the same pattern as the "lambda" expression for python. For instance, x: x/12 indicates that the input value x will generate an output with the value of x/12. The syntax for Expression transform is as following:

import pandas as pd
import numpy as np
from mlnet.preprocessing.expression import Expression

data = pd.DataFrame({'x1':[1,2,3,4],'y1':[0.1,0.2,0.3,0.4],'x2':[-1,-2,-3,-4],'y2':[-0.1,-0.2,-0.3,-0.4]})
xf = Expression(
                    columns = {'sum_1': ['x1', 'y1'], 'sum_2': ['x2','y2']},
                    expression = '(x,y) : x + y'
               )

xf.fit_transform(data)
sum_1 sum_2 x1 x2 y1 y2
0 1.1 -1.1 1 -1 0.1 -0.1
1 2.2 -2.2 2 -2 0.2 -0.2
2 3.3 -3.3 3 -3 0.3 -0.3
3 4.4 -4.4 4 -4 0.4 -0.4

In the above example:

  1. The expression will be applied to all the input/output columns specified in the columns argument.
    In the above example, the same expression x + y will be used to operate on columns ['x1','y1'] and ['x2','y2'] respectively to create two output columns, 'sum_1' and 'sum_2'. The values are the sum of the corresponding two columns.

  2. The input-output mapping can be one column to one column or many columns to one column according to the expression.
    However, the number of variables specified in the expression, e.g. 2 in the above case, should be consistent with the number of input columns.

  3. The input columns can contain at most one VectorType column.
    In that case, all the slots will be operated respectively as the representation of the corresponding input column and the output will be of VectorType with each slots corresponding to the results for the origin slot from the input column.

The variables indicated in the expression before the ":" correspond to the first, second, ..., of the input column as indciated in the columns argument. The syntax after ":" defines the operation. In the following section, we will discuss more details about the syntax for the expression.

1 Operators

In this section, we discuss the operators, such as +, -, *, etc. that is accepted.

1.1 Basic Operators

Almost all the basic operators are accepted such as +, -, *, /, %, etc.. For example,

xf = Expression(
                    columns = {'result': ['x1', 'x2', 'y1', 'y2']},
                    expression = '(a,b,c,d) : (a + 2 * c) * 10 * c % (-b)' # (1 + 2 * 0.1) * 10 * 0.1 % (1) = 0.2 ==> 1st row
               )

xf.fit_transform(data)
result x1 x2 y1 y2
0 0.2 1 -1 0.1 -0.1
1 0.8 2 -2 0.2 -0.2
2 1.8 3 -3 0.3 -0.3
3 3.2 4 -4 0.4 -0.4

Unless otherwise noted, binary operators are left associative and propagate NA values (if either operand value is NA, the result is NA). Generally, overflow of integer values produces NA, while overflow of floating point values produces infinity.

For the full list of the operators please refer to the appendix at the end of this tutorial.

1.2 Condition Operators

The users can also specify an expression using condition, together with ? and : operator. For instance,

                                    condition ? value_1 : value_2
                

returns value_1 if condition is true, else returns values_2. In the condition expression, most of the commonly seen condition operators can be used, such as ==, &&, >=, etc.. For example,

xf = Expression(
                    columns = {'result': ['x1', 'y1']},
                    expression = '(a,b) : !(a + b > 3 || a < 2)? 1:0' # if not {(1 + 0.1) > 3 or 1 < 2}, return 1, else 0 ==> 1st row
               )

xf.fit_transform(data)
result x1 x2 y1 y2
0 0 1 -1 0.1 -0.1
1 1 2 -2 0.2 -0.2
2 0 3 -3 0.3 -0.3
3 0 4 -4 0.4 -0.4

For the full list of the operators please refer to the appendix at the end of this tutorial.

1.3 Missing Value Handler

Another convenient way to handle missing values is to use ?? in the expression. x??y indicates that if x is NaN, return y, else return x. For example,

data = pd.DataFrame({'x1':[1,2,np.nan,4],'y1':[0.1,0.2,0.3,0.4],'x2':[-1,-2,-3,-4],'y2':[-0.1,-0.2,-0.3,-0.4]})
xf = Expression(
                    columns = {'filled_x1': ['x1', 'y1']},
                    expression = '(x,y) : x ?? y'
               )

xf.fit_transform(data)
filled_x1 x1 x2 y1 y2
0 1.0 1.0 -1 0.1 -0.1
1 2.0 2.0 -2 0.2 -0.2
2 0.3 NaN -3 0.3 -0.3
3 4.0 4.0 -4 0.4 -0.4

We can also fill-in with constant.

xf = Expression(
                    columns = {'filled_x1': ['x1', 'y1']},
                    expression = '(x,y) : x ?? -1'
               )

xf.fit_transform(data)
filled_x1 x1 x2 y1 y2
0 1.0 1.0 -1 0.1 -0.1
1 2.0 2.0 -2 0.2 -0.2
2 -1.0 NaN -3 0.3 -0.3
3 4.0 4.0 -4 0.4 -0.4

2 Supported Functions

A number of functions are also supported in the expression, such as abs(), isna(), min(), etc.. For example,

xf = Expression(
                    columns = {'filled_x1': ['x1', 'y2']},
                    expression = '(x,y) : min(x + abs(y), 2) ?? 999' # min{1 + abs(-0.1),2}, if NaN, return 999 ==> 1st row
               )

xf.fit_transform(data)
filled_x1 x1 x2 y1 y2
0 1.1 1.0 -1 0.1 -0.1
1 2.0 2.0 -2 0.2 -0.2
2 999.0 NaN -3 0.3 -0.3
3 2.0 4.0 -4 0.4 -0.4

For the full list of the supported functions please refer to the Appendix II: Functions at the end of this tutorial.

3 Temporary Variables

In the expression, users can create temperary variables to facilitate the calculation. The syntax for creating those temporary variables is as following:

                           with(definition_for_temporary_variables ; expression)
                           

For example,

xf = Expression(
                    columns = {'filled_x1': ['x1', 'y1']},
                    expression = '(x,y) : with(z = x + y; z > 3? "L": "S")' # z = 1 + 0.1, if z > 2, return "L", else "S" ==> 1st row
               )

xf.fit_transform(data)
filled_x1 x1 x2 y1 y2
0 S 1.0 -1 0.1 -0.1
1 S 2.0 -2 0.2 -0.2
2 None NaN -3 0.3 -0.3
3 L 4.0 -4 0.4 -0.4

Now we can write a relatively complicated example, guess the output!

xf = Expression(
                    columns = {'filled_x1': ['x1']},
                    expression = '(x) : with(z = min(x,3); (z < 1.5 ? "S": z == 3? "Capped":"Other") ?? "Missing")' 
               )

xf.fit_transform(data)
filled_x1 x1 x2 y1 y2
0 S 1.0 -1 0.1 -0.1
1 Other 2.0 -2 0.2 -0.2
2 Missing NaN -3 0.3 -0.3
3 Capped 4.0 -4 0.4 -0.4

Appendix I: Operators

Numeric

Operators Meaning Arity Comments
+, -, *, /, %, ^ numeric Binary Numeric operations with NA propagation. For integer operands, overflow or divide by zero results in NA.
() parenthetical grouping Unary Standard meaning

Conditional

Operators Meaning Arity Comments
? : conditional Ternary The expression condition ? value1 : value2 resolves to value1 if condition is true, to value2 if condition is false, and to NA if condition is NA. The condition must be boolean, while value1 and value2 must be of compatible type.
logical or Binary The operands and result are boolean. If one operand is true, the result is true, even when the other operand is NA. Otherwise, an NA operand value results in NA.
&& logical and Binary The operands and result are boolean. If one operand is false, the result is false, even when the other operand is NA. Otherwise, an NA operand value results in NA.
==,=,!=,<>,<,<=,>,>= logical Multiple i). The comparison operators are multi-arity, meaning they can be applied to two or more operands. For example, a == b == c results in true if a, b, and c have the same value. The not equal operator requires that all of the operands be distinct, so 1 != 2 != 1 is false. To test whether x is non-negative but less than 10, use 0 <= x < 10. There is no need to write 0 <= x && x < 10, and doing so will not perform as well. Operators listed on the same line can be combined in a single expression, so a > b >= c is legal, but a < b >= c is not.
ii). Equals and not equals apply to any operand type, while the ordered operators require numeric operands.
iii). These short circuit, so 8 < 7 < NA will produce false. Once an NA is encountered in the chain, the result is NA, so 9 < NA < 8 < 7 is NA, not false.

Appendix II: Functions

The expression transform supports many useful functions and will support more in the future (please request any that you would find useful). General unary functions that can accept an operand of any type are listed in the following table.

To operate on missing values, we have:

Name Meaning Comments
isna() test for na Returns a boolean value indicating whether the operand is an NA value.
na() fill na Returns the NA value of the same type as the operand. For example, if x is a text value and you wish to map "bad" to a text NA value, use x == "bad" ? na(x) : x. Note that this does not evaluate the operand, it only uses the operand to determine the type of NA to return, and that determination happens at compile time.

To conver types, we have:

Name Meaning Comments
bool() to BL The operand must be TX (text) or BL (boolean).
int() to I4 The input may be BL, I4, I8, R4, R8, or TX.
long() to I8 The input may be BL, I4, I8, R4, R8, or TX.
float() to R4 The input may be BL, I4, I8, R4, R8, or TX.
double() to R8 The input may be BL, I4, I8, R4, R8, or TX.
text() to BL The input may be BL, I4, I8, R4, R8, or TX. This produces a default text representation. In the future, we'll add functions to control the formatting.

The unary functions that require a numeric (I4, I8, R4, or R8) operand are listed in the following table. The result type is the same as the operand type. An NA operand value produces NA.

Name Meaning Comments
abs() absolute value Produces the absolute value of the operand.
sign() sign (-1, 0, 1) Produces -1, 0, or 1 depending on whether the operand is negative, zero, or positive.

The unary functions that require a floating point (R4 or R8) operand are listed in the following table. The result type is the same as the operand type. Overflow produces infinity. Invalid input values produce NA.

Name Meaning Comments
sqrt() square root Produces the minimum/maximum of the operands.
trunc() truncate to an integer Rounds toward zero to the nearest integer value.
floor() floor Rounds toward negative infinity to the nearest integer value.
ceil() ceiling Rounds toward positive infinity to the nearest integer value.
round() unbiased rounding Rounds to the nearest integer value. When the operand is half way between two integer values, this produces the even integer.
exp() exponential Raises e to the operand.
ln(), log() logarithm Produces the natural (base e) logarithm. There is also a two operand version of log for using a different base.
deg() radians to degrees Maps from radians to degrees.
rad() degrees to radians Maps from degrees to radians.
sin() sine Takes the sine of an angle. The sin function assumes the operand is in radians, while the sind function assumes that the operand is in degrees.
cos() cosine Takes the cosine of an angle. The cos function assumes the operand is in radians, while the cosd function assumes that the operand is in degrees.
tan() tangent Takes the tangent of an angle. The tan function assumes the operand is in radians, while the tand function assumes that the operand is in degrees.
sinh() hyperbolic sine Takes the hyperbolic sine of its operand.
cosh() hyperbolic cosine Takes the hyperbolic cosine of its operand.
tanh() hyperbolic tangent Takes the hyperbolic tangent of its operand.
asin() inverse sine Takes the inverse sine of its operand.
acos() inverse cosine Takes the inverse cosine of its operand.
atan() inverse tangent Takes the inverse tangent of its operand.

The binary functions that require floating point (R4 or R8) operands are listed in the following table. When the operand types aren't the same, the operands are promoted to an appropriate type. The result type is the same as the promoted operand type. An NA operand value produces NA.

Name Meaning Comments
log(x,y) logarithm with given base The second operand is the base. The first is the value to take the logarithm of.
atan2, atanyx determine angle Determines the angle between -pi and pi from the given y and x values. Note that y is the first operand.

The text functions are listed in the following table.

Name Meaning Comments
len(x) length of text The operand must be text. The result is an I4 indicating the length of the operand. If the operand is NA, the result is NA.
lower(x), upper(x) lower or upper case Maps the text to lower or upper case.
left(x, k), right(x, k) substring The first operand must be text and the second operand must be I4. If the second operand is negative it is treated as an offset from the end of the text. This adjusted index is then clamped to 0 to len(x). The result is the characters to the left or right of the resulting position.
mid(x, a, b) substring The first operand must be text and the other two operands must be I4. The indices are transformed in the same way as for the left and right functions: negative values are treated as offsets from the end of the text; these adjusted indices are clamped to 0 to len(x). The second clamped index is also clamped below to the first clamped index. The result is the characters between these two clamped indices.
concat(x1, x2, ..., xn) concatenation This accepts an arbitrary number of operands (including zero). All operands must be text. If any operand is NA, the result is NA. Otherwise the result is the concatenation of all the operands, in order.