quantize_fl()

Article
08/15/2023

The function quantize_fl() is a user-defined function (UDF) that bins metric columns. It quantizes metric columns to categorical labels, based on the K-Means algorithm.

Prerequisites

The Python plugin must be enabled on the cluster. This is required for the inline Python used in the function.

The Python plugin must be enabled on the database. This is required for the inline Python used in the function.

Syntax

T | invoke quantize_fl(num_bins, in_cols, out_cols [, labels ])

Learn more about syntax conventions.

Parameters

Name	Type	Required	Description
num_bins	`int`	✔️	The required number of bins.
in_cols	`dynamic`	✔️	An array containing the names of the columns to quantize.
out_cols	`dynamic`	✔️	An array containing the names of the respective output columns for the binned values.
labels	`dynamic`		An array containing the label names. If unspecified, bin ranges will be used.

Function definition

You can define the function by either embedding its code as a query-defined function, or creating it as a stored function in your database, as follows:

Query-defined
Stored

Define the function using the following let statement. No permissions are required.

Important

A let statement can't run on its own. It must be followed by a tabular expression statement. To run a working example of quantize_fl(), see Example.

let quantize_fl=(tbl:(*), num_bins:int, in_cols:dynamic, out_cols:dynamic, labels:dynamic=dynamic(null))
{
    let kwargs = bag_pack('num_bins', num_bins, 'in_cols', in_cols, 'out_cols', out_cols, 'labels', labels);
    let code = ```if 1:

        from sklearn.preprocessing import KBinsDiscretizer

        num_bins = kargs["num_bins"]
        in_cols = kargs["in_cols"]
        out_cols = kargs["out_cols"]
        labels = kargs["labels"]

        result = df
        binner = KBinsDiscretizer(n_bins=num_bins, encode="ordinal", strategy="kmeans")
        df_in = df[in_cols]
        bdata = binner.fit_transform(df_in)
        if labels is None:
            for i in range(len(out_cols)):    # loop on each column and convert it to binned labels
                ii = np.round(binner.bin_edges_[i], 3)
                labels = [str(ii[j-1]) + '-' + str(ii[j]) for j in range(1, num_bins+1)]
                result.loc[:,out_cols[i]] = np.take(labels, bdata[:, i].astype(int))
        else:
            result[out_cols] = np.take(labels, bdata.astype(int))
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
// Write your query to use the function here.

Define the stored function once using the following .create function. Database User permissions are required.

Important

You must run this code to create the function before you can use the function as shown in the Example.

.create function with (folder = "Packages\\ML", docstring = "Binning metric columns")
quantize_fl(tbl:(*), num_bins:int, in_cols:dynamic, out_cols:dynamic, labels:dynamic)
{
    let kwargs = bag_pack('num_bins', num_bins, 'in_cols', in_cols, 'out_cols', out_cols, 'labels', labels);
    let code = ```if 1:

        from sklearn.preprocessing import KBinsDiscretizer

        num_bins = kargs["num_bins"]
        in_cols = kargs["in_cols"]
        out_cols = kargs["out_cols"]
        labels = kargs["labels"]

        result = df
        binner = KBinsDiscretizer(n_bins=num_bins, encode="ordinal", strategy="kmeans")
        df_in = df[in_cols]
        bdata = binner.fit_transform(df_in)
        if labels is None:
            for i in range(len(out_cols)):    # loop on each column and convert it to binned labels
                ii = np.round(binner.bin_edges_[i], 3)
                labels = [str(ii[j-1]) + '-' + str(ii[j]) for j in range(1, num_bins+1)]
                result.loc[:,out_cols[i]] = np.take(labels, bdata[:, i].astype(int))
        else:
            result[out_cols] = np.take(labels, bdata.astype(int))
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
}

Example

The following example uses the invoke operator to run the function.

Query-defined
Stored

To use a query-defined function, invoke it after the embedded function definition.

let quantize_fl=(tbl:(*), num_bins:int, in_cols:dynamic, out_cols:dynamic, labels:dynamic=dynamic(null))
{
    let kwargs = bag_pack('num_bins', num_bins, 'in_cols', in_cols, 'out_cols', out_cols, 'labels', labels);
    let code = ```if 1:

        from sklearn.preprocessing import KBinsDiscretizer

        num_bins = kargs["num_bins"]
        in_cols = kargs["in_cols"]
        out_cols = kargs["out_cols"]
        labels = kargs["labels"]

        result = df
        binner = KBinsDiscretizer(n_bins=num_bins, encode="ordinal", strategy="kmeans")
        df_in = df[in_cols]
        bdata = binner.fit_transform(df_in)
        if labels is None:
            for i in range(len(out_cols)):    # loop on each column and convert it to binned labels
                ii = np.round(binner.bin_edges_[i], 3)
                labels = [str(ii[j-1]) + '-' + str(ii[j]) for j in range(1, num_bins+1)]
                result.loc[:,out_cols[i]] = np.take(labels, bdata[:, i].astype(int))
        else:
            result[out_cols] = np.take(labels, bdata.astype(int))
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
//
union 
(range x from 1 to 5 step 1),
(range x from 10 to 15 step 1),
(range x from 20 to 25 step 1)
| extend x_label='', x_bin=''
| invoke quantize_fl(3, pack_array('x'), pack_array('x_label'), pack_array('Low', 'Med', 'High'))
| invoke quantize_fl(3, pack_array('x'), pack_array('x_bin'), dynamic(null))

Important

For this example to run successfully, you must first run the Function definition code to store the function.

union 
(range x from 1 to 5 step 1),
(range x from 10 to 15 step 1),
(range x from 20 to 25 step 1)
| extend x_label='', x_bin=''
| invoke quantize_fl(3, pack_array('x'), pack_array('x_label'), pack_array('Low', 'Med', 'High'))
| invoke quantize_fl(3, pack_array('x'), pack_array('x_bin'), dynamic(null))

Output

x	x_label	x_bin
1	Low	1.0-7.75
2	Low	1.0-7.75
3	Low	1.0-7.75
4	Low	1.0-7.75
5	Low	1.0-7.75
20	High	17.5-25.0
21	High	17.5-25.0
22	High	17.5-25.0
23	High	17.5-25.0
24	High	17.5-25.0
25	High	17.5-25.0
10	Med	7.75-17.5
11	Med	7.75-17.5
12	Med	7.75-17.5
13	Med	7.75-17.5
14	Med	7.75-17.5
15	Med	7.75-17.5

This feature isn't supported.

Share via

quantize_fl()

Prerequisites

Syntax

Parameters

Function definition

Example

Feedback

Feedback

Additional resources