kmeans_fl()
The function kmeans_fl()
is a UDF (user-defined function) that clusterizes a dataset using the k-means algorithm.
Prerequisites
- The Python plugin must be enabled on the cluster. This is required for the inline Python used in the function.
- The Python plugin must be enabled on the database. This is required for the inline Python used in the function.
Syntax
T | invoke kmeans_fl(
k,
features_cols,
cluster_col)
Learn more about syntax conventions.
Parameters
Name | Type | Required | Description |
---|---|---|---|
k | int |
✔️ | The number of clusters. |
features_cols | dynamic |
✔️ | An array containing the names of the features columns to use for clustering. |
cluster_col | string |
✔️ | The name of the column to store the output cluster ID for each record. |
Function definition
You can define the function by either embedding its code as a query-defined function, or creating it as a stored function in your database, as follows:
Define the function using the following let statement. No permissions are required.
Important
A let statement can't run on its own. It must be followed by a tabular expression statement. To run a working example of kmeans_fl()
, see Examples.
let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
let code = ```if 1:
from sklearn.cluster import KMeans
k = kargs["k"]
features = kargs["features"]
cluster_col = kargs["cluster_col"]
km = KMeans(n_clusters=k)
df1 = df[features]
km.fit(df1)
result = df
result[cluster_col] = km.labels_
```;
tbl
| evaluate python(typeof(*), code, kwargs)
};
// Write your query to use the function here.
Examples
The following examples use the invoke operator to run the function.
Clusterize room occupancy from sensors measurements
To use a query-defined function, invoke it after the embedded function definition.
let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
let code = ```if 1:
from sklearn.cluster import KMeans
k = kargs["k"]
features = kargs["features"]
cluster_col = kargs["cluster_col"]
km = KMeans(n_clusters=k)
df1 = df[features]
km.fit(df1)
result = df
result[cluster_col] = km.labels_
```;
tbl
| evaluate python(typeof(*), code, kwargs)
};
//
// Clusterize room occupancy from sensors measurements.
//
// Occupancy Detection is an open dataset from UCI Repository at https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
// It contains experimental data for binary classification of room occupancy from Temperature, Humidity, Light, and CO2.
//
OccupancyDetection
| extend cluster_id=int(null)
| invoke kmeans_fl(5, pack_array("Temperature", "Humidity", "Light", "CO2", "HumidityRatio"), "cluster_id")
| sample 10
Output
Timestamp | Temperature | Humidity | Light | CO2 | HumidityRatio | Occupancy | Test | cluster_id |
---|---|---|---|---|---|---|---|---|
2015-02-02 14:38:00.0000000 | 23.64 | 27.1 | 473 | 908.8 | 0.00489763 | TRUE | TRUE | 1 |
2015-02-03 01:47:00.0000000 | 20.575 | 22.125 | 0 | 446 | 0.00330878 | FALSE | TRUE | 0 |
2015-02-10 08:47:00.0000000 | 20.42666667 | 33.56 | 405 | 494.3333333 | 0.004986493 | TRUE | FALSE | 4 |
2015-02-10 09:15:00.0000000 | 20.85666667 | 35.09666667 | 433 | 665.3333333 | 0.005358055 | TRUE | FALSE | 4 |
2015-02-11 16:13:00.0000000 | 21.89 | 30.0225 | 429 | 771.75 | 0.004879358 | TRUE | TRUE | 4 |
2015-02-13 14:06:00.0000000 | 23.4175 | 26.5225 | 608 | 599.75 | 0.004728116 | TRUE | TRUE | 4 |
2015-02-13 23:09:00.0000000 | 20.13333333 | 32.2 | 0 | 502.6666667 | 0.004696278 | FALSE | TRUE | 0 |
2015-02-15 18:30:00.0000000 | 20.5 | 32.79 | 0 | 666.5 | 0.004893459 | FALSE | TRUE | 3 |
2015-02-17 13:43:00.0000000 | 21.7 | 33.9 | 454 | 1167 | 0.005450924 | TRUE | TRUE | 1 |
2015-02-17 18:17:00.0000000 | 22.025 | 34.2225 | 0 | 1538.25 | 0.005614538 | FALSE | TRUE | 2 |
Extract the centroids and size of each cluster
To use a query-defined function, invoke it after the embedded function definition.
let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
let code = ```if 1:
from sklearn.cluster import KMeans
k = kargs["k"]
features = kargs["features"]
cluster_col = kargs["cluster_col"]
km = KMeans(n_clusters=k)
df1 = df[features]
km.fit(df1)
result = df
result[cluster_col] = km.labels_
```;
tbl
| evaluate python(typeof(*), code, kwargs)
};
OccupancyDetection
| extend cluster_id=int(null)
| invoke kmeans_fl(5, pack_array("Temperature", "Humidity", "Light", "CO2", "HumidityRatio"), "cluster_id")
| summarize Temperature=avg(Temperature), Humidity=avg(Humidity), Light=avg(Light), CO2=avg(CO2), HumidityRatio=avg(HumidityRatio), num=count() by cluster_id
| order by num
Output
cluster_id | Temperature | Humidity | Light | CO2 | HumidityRatio | num |
---|---|---|---|---|---|---|
0 | 20.3507186863278 | 27.1521395395781 | 10.1995789883291 | 486.804272186974 | 0.00400132147662714 | 11124 |
4 | 20.9247315268427 | 28.7971160082823 | 20.7311894656536 | 748.965771574469 | 0.00440412568299058 | 3063 |
1 | 22.0284137970445 | 27.8953334469889 | 481.872136037748 | 1020.70779349773 | 0.00456692559904535 | 2514 |
3 | 22.0344177115763 | 25.1151053429273 | 462.358969056434 | 656.310608696507 | 0.00411782436443015 | 2176 |
2 | 21.4091216082295 | 31.8363099552939 | 174.614816229606 | 1482.05062388414 | 0.00504573022875817 | 1683 |
This feature isn't supported.
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for