kmeans_fl()

Article
07/25/2024

The function kmeans_fl() is a UDF (user-defined function) that clusterizes a dataset using the k-means algorithm.

Prerequisites

The Python plugin must be enabled on the cluster. This is required for the inline Python used in the function.

The Python plugin must be enabled on the database. This is required for the inline Python used in the function.

Syntax

T | invoke kmeans_fl(k, features, cluster_col)

Learn more about syntax conventions.

Parameters

Name	Type	Required	Description
k	`int`	✔️	The number of clusters.
features	`dynamic`	✔️	An array containing the names of the features columns to use for clustering.
cluster_col	`string`	✔️	The name of the column to store the output cluster ID for each record.

Function definition

You can define the function by either embedding its code as a query-defined function, or creating it as a stored function in your database, as follows:

Query-defined
Stored

Define the function using the following let statement. No permissions are required.

Important

A let statement can't run on its own. It must be followed by a tabular expression statement. To run a working example of kmeans_fl(), see example.

let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]

        df1 = df[features]
        km = KMeans(n_clusters=k, random_state=0)
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
// Write your query to use the function here.

Define the stored function once using the following .create function. Database User permissions are required.

Important

You must run this code to create the function before you can use the function as shown in the example.

.create-or-alter function with (folder = "Packages\\ML", docstring = "K-Means clustering")
kmeans_fl(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]

        df1 = df[features]
        km = KMeans(n_clusters=k, random_state=0)
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
}

Example

The following example uses the invoke operator to run the function.

Clustering of artificial dataset with three clusters

Query-defined
Stored

To use a query-defined function, invoke it after the embedded function definition.

let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = bag_pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:

        from sklearn.cluster import KMeans

        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]

        df1 = df[features]
        km = KMeans(n_clusters=k, random_state=0)
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
union 
(range x from 1 to 100 step 1 | extend x=rand()+3, y=rand()+2),
(range x from 101 to 200 step 1 | extend x=rand()+1, y=rand()+4),
(range x from 201 to 300 step 1 | extend x=rand()+2, y=rand()+6)
| extend cluster_id=int(null)
| invoke kmeans_fl(3, bag_pack("x", "y"), "cluster_id")
| render scatterchart with(series=cluster_id)

Important

For this example to run successfully, you must first run the Function definition code to store the function.

union 
(range x from 1 to 100 step 1 | extend x=rand()+3, y=rand()+2),
(range x from 101 to 200 step 1 | extend x=rand()+1, y=rand()+4),
(range x from 201 to 300 step 1 | extend x=rand()+2, y=rand()+6)
| extend cluster_id=int(null)
| invoke kmeans_fl(3, bag_pack("x", "y"), "cluster_id")
| render scatterchart with(series=cluster_id)

Screenshot of scatterchart of K-Means clustering of artificial dataset with three clusters.

This feature isn't supported.

Share via

kmeans_fl()

Prerequisites

Syntax

Parameters

Function definition

Example

Clustering of artificial dataset with three clusters

Feedback

Feedback

Additional resources