kmeans_fl()

The function kmeans_fl() is a UDF (user-defined function) that clusterizes a dataset using the k-means algorithm.

Prerequisites

  • The Python plugin must be enabled on the cluster. This is required for the inline Python used in the function.

Syntax

T | invoke kmeans_fl(k, features_cols, cluster_col)

Learn more about syntax conventions.

Parameters

Name Type Required Description
k int ✔️ The number of clusters.
features_cols dynamic ✔️ An array containing the names of the features columns to use for clustering.
cluster_col string ✔️ The name of the column to store the output cluster ID for each record.

Function definition

You can define the function by either embedding its code as a query-defined function, or creating it as a stored function in your database, as follows:

Define the function using the following let statement. No permissions are required.

Important

A let statement can't run on its own. It must be followed by a tabular expression statement. To run a working example of kmeans_fl(), see Examples.

let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]

        km = KMeans(n_clusters=k)
        df1 = df[features]
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
// Write your query to use the function here.

Examples

The following examples use the invoke operator to run the function.

Clusterize room occupancy from sensors measurements

To use a query-defined function, invoke it after the embedded function definition.

let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]

        km = KMeans(n_clusters=k)
        df1 = df[features]
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
//
// Clusterize room occupancy from sensors measurements.
//
// Occupancy Detection is an open dataset from UCI Repository at https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
// It contains experimental data for binary classification of room occupancy from Temperature, Humidity, Light, and CO2.
//
OccupancyDetection
| extend cluster_id=int(null)
| invoke kmeans_fl(5, pack_array("Temperature", "Humidity", "Light", "CO2", "HumidityRatio"), "cluster_id")
| sample 10

Output

Timestamp Temperature Humidity Light CO2 HumidityRatio Occupancy Test cluster_id
2015-02-02 14:38:00.0000000 23.64 27.1 473 908.8 0.00489763 TRUE TRUE 1
2015-02-03 01:47:00.0000000 20.575 22.125 0 446 0.00330878 FALSE TRUE 0
2015-02-10 08:47:00.0000000 20.42666667 33.56 405 494.3333333 0.004986493 TRUE FALSE 4
2015-02-10 09:15:00.0000000 20.85666667 35.09666667 433 665.3333333 0.005358055 TRUE FALSE 4
2015-02-11 16:13:00.0000000 21.89 30.0225 429 771.75 0.004879358 TRUE TRUE 4
2015-02-13 14:06:00.0000000 23.4175 26.5225 608 599.75 0.004728116 TRUE TRUE 4
2015-02-13 23:09:00.0000000 20.13333333 32.2 0 502.6666667 0.004696278 FALSE TRUE 0
2015-02-15 18:30:00.0000000 20.5 32.79 0 666.5 0.004893459 FALSE TRUE 3
2015-02-17 13:43:00.0000000 21.7 33.9 454 1167 0.005450924 TRUE TRUE 1
2015-02-17 18:17:00.0000000 22.025 34.2225 0 1538.25 0.005614538 FALSE TRUE 2

Extract the centroids and size of each cluster

To use a query-defined function, invoke it after the embedded function definition.

let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]

        km = KMeans(n_clusters=k)
        df1 = df[features]
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
OccupancyDetection
| extend cluster_id=int(null)
| invoke kmeans_fl(5, pack_array("Temperature", "Humidity", "Light", "CO2", "HumidityRatio"), "cluster_id")
| summarize Temperature=avg(Temperature), Humidity=avg(Humidity), Light=avg(Light), CO2=avg(CO2), HumidityRatio=avg(HumidityRatio), num=count() by cluster_id
| order by num

Output

cluster_id Temperature Humidity Light CO2 HumidityRatio num
0 20.3507186863278 27.1521395395781 10.1995789883291 486.804272186974 0.00400132147662714 11124
4 20.9247315268427 28.7971160082823 20.7311894656536 748.965771574469 0.00440412568299058 3063
1 22.0284137970445 27.8953334469889 481.872136037748 1020.70779349773 0.00456692559904535 2514
3 22.0344177115763 25.1151053429273 462.358969056434 656.310608696507 0.00411782436443015 2176
2 21.4091216082295 31.8363099552939 174.614816229606 1482.05062388414 0.00504573022875817 1683

This feature isn't supported.