kmeans_fl()

The function kmeans_fl() is a UDF (user-defined function) that clusterizes a dataset using the k-means algorithm.

Prerequisites

  • The Python plugin must be enabled on the cluster. This is required for the inline Python used in the function.

Syntax

T | invoke kmeans_fl(k, features, cluster_col)

Learn more about syntax conventions.

Parameters

Name Type Required Description
k int ✔️ The number of clusters.
features dynamic ✔️ An array containing the names of the features columns to use for clustering.
cluster_col string ✔️ The name of the column to store the output cluster ID for each record.

Function definition

You can define the function by either embedding its code as a query-defined function, or creating it as a stored function in your database, as follows:

Define the function using the following let statement. No permissions are required.

Important

A let statement can't run on its own. It must be followed by a tabular expression statement. To run a working example of kmeans_fl(), see example.

let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]

        df1 = df[features]
        km = KMeans(n_clusters=k, random_state=0)
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
// Write your query to use the function here.

Example

The following example uses the invoke operator to run the function.

Clustering of artificial dataset with three clusters

To use a query-defined function, invoke it after the embedded function definition.

let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]

        df1 = df[features]
        km = KMeans(n_clusters=k, random_state=0)
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
union 
(range x from 1 to 100 step 1 | extend x=rand()+3, y=rand()+2),
(range x from 101 to 200 step 1 | extend x=rand()+1, y=rand()+4),
(range x from 201 to 300 step 1 | extend x=rand()+2, y=rand()+6)
| extend cluster_id=int(null)
| invoke kmeans_fl(3, bag_pack("x", "y"), "cluster_id")
| render scatterchart with(series=cluster_id)

Screenshot of scatterchart of K-Means clustering of artificial dataset with three clusters.

This feature isn't supported.