Split a dataset using a relative expression
Important
Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.
Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.
- See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
- Learn more about Azure Machine Learning.
ML Studio (classic) documentation is being retired and may not be updated in the future.
This article describes how to use the Relative Expression Split option in the Split Data module of Machine Learning Studio (classic). This option is helpful when you need to divide a dataset into training and testing datasets using a numerical expression. For example:
- Age greater than 40 vs. 40 or younger
- Test score of 60 or higher vs. less than 60
- Rank value of 1 vs. all other values
Note
Applies to: Machine Learning Studio (classic) only
Similar drag-and-drop modules are available in Azure Machine Learning designer.
To divide your data, you choose a single numeric column in your data, and define an expression to use in evaluating each row. The relative expression must include the column name, the value, and an operator such as greater than and less than, equal and not equals.
This option divides the dataset into two groups.
For general information about data partitioning for machine learning experiments, see Split Data and Partition and Split.
Related tasks
Other options in the Split Data module:
Split data using regular expressions: Apply a regular expression to a single text column, and divide the dataset based on the results
Split recommender datasets: Divide datasets that are used in recommendation models. The dataset should have three columns: items, users, and ratings
Use a relative expression to divide a dataset
Add the Split Data module to your experiment in Stuio, and connect it as input to the dataset you want to split.
For Splitting mode, select relative expression split.
In the Relational expression text box, type an expression that performs a numeric comparison operation, on a single column:
The column contains numbers of any numeric data type, including date/time data types.
The expression can reference a maximum of one column name.
Use the ampersand character (&) for the AND operation and use the pipe character (|) for the OR operation.
The following operators are supported:
<
,>
,<=
,>=
,==
,!=
You cannot group operations by using
(
and)
.
For ideas, see the Examples section.
Run the experiment, or right-click the module and select Run selected.
The expression divides the dataset into two sets of rows: rows with values that meet the condition, and all remaining rows.
If you need to perform additional split operations, you can either add a second instance of *Split Data, or use the Apply SQL Transformation module and define a CASE statement.
Examples of relatve expressions
The following examples demonstrate how to divide a dataset using the Relative Expression option in the Split Data module:
Using calendar year
A common scenario is to divide a dataset by years. The following expression selects all rows where the values in the column Year
are greater than 2010
.
\"Year" > 2010
The date expression must account for all date parts that are included in the data column, and the format of dates in the data column must be consistent.
For example, in a date column using the format mmddyyyy
, the expression should be something like this:
\"Date" > 1/1/2010
Using column indices
The following expression demonstrates how you can use the column index to select all rows in the first column of the dataset that contain values less than or equal to 30, but not equal to 20.
(\0)<=30 & !=20
Compound operation on time values using multiple splits
Suppose you want to split a table of log data, to group queries that run too long. You could use the following relative expression on the column, Elapsed
, to get the queries that ran over 1 minute.
\"Elapsed" >00:01:00
To get the queries with response times under one minute but more than 30 seconds, add another instance of Split Data on the right-hand output, and use an expression like this:
\"Elapsed" <:00:01:00 & >00:00:30
Split dataset on date values
The following relative expression divides the dataset by using the date values in the column dt1
.
\"dt1" > 10-08-2015
Rows with a date greater than 10-08-2015 are added to the first (left) output dataset.
Rows with a date of 10-08-2015 or earlier are added to the second (right) output dataset.
Technical notes
This section contains implementation details, tips, and answers to frequently asked questions.
Restrictions
The following restrictions apply to relative expressions on a dataset:
- Relative expressions can be applied only to numeric data types and date/time data types.
- Relative expressions can reference a maximum of one column name.
- Use the ampersand character (&) for the AND operation and the pipe character (|) for the OR operation.
- The following operators are allowed for relative expressions:
<
,>
,<=
,>=
,==
,!=
- Grouping operations with parentheses is not supported.