pairwise_dist_fl()

[アーティクル]
01/18/2024

複数の名義変数と数値変数に基づいて、エンティティ間のペアワイズ距離を計算します。

関数 pairwise_dist_fl() は UDF (ユーザー定義関数) であり、同じパーティションに属するデータポイント間の多変量距離を計算します。これは、名目変数と数値変数を考慮します。

エンティティ名とパーティション名以外のすべての文字列フィールドは、名義変数と見なされます。値が異なる場合の距離は 1 に等しく、値が同じである場合の距離は 0 に等しくなります。
すべての数値フィールドは、数値変数と見なされます。それらは、z スコアに変換することで正規化され、その距離は差の絶対値として計算されます。データポイント間の多変量距離の合計は、変数間の距離の平均として計算されます。

0 に近い距離はエンティティが似ていることを意味し、1 より上の距離は異なっていることを意味します。同様に、平均距離が 1 以上のエンティティは、パーティション内の他の多くのエンティティと異なっていることを示し、外れ値の可能性を示します。

この関数の出力は、同じパーティション内のエンティティ間のペアワイズ距離です。これは、同様のペアや異なるペアを探すためにそのまま使用できます。距離が最小のエンティティは、多くの特徴を共有しています。また、これは距離行列に簡単に変換することができ (使用例を参照)、クラスタリングや外れ値の検出アルゴリズムの入力としても使用できます。

構文

pairwise_dist_fl(エンティティ、 パーティション)

構文規則について詳しく知る。

パラメーター

名前	型	必須	説明
エンティティ	`string`	✔️	距離の計算対象となるエンティティの名前または ID を含む入力テーブル列の名前。
partition	`string`	✔️	パーティションまたはスコープを含む入力テーブル列の名前。同じパーティション内のすべてのエンティティのペアに関して距離が計算されます。

関数の定義

関数を定義するには、次のようにコードをクエリ定義関数として埋め込むか、データベースに格納された関数として作成します。

クエリ定義
保管済み

次の let ステートメントを使用して関数を定義します。権限は必要ありません。

重要

let ステートメントを単独で実行することはできません。その後に表形式の式ステートメントを指定する必要があります。の動作例 pairwise_dist_fl()を実行するには、「例」を参照してください。

let pairwise_dist_fl = (tbl:(*), id_col:string, partition_col:string)
{
    let generic_dist = (value1:dynamic, value2:dynamic) 
    {
        // Calculates the distance between two values; treats all strings as nominal values and numbers as numerical,
        // can be extended to other data types or tweaked by adding weights or changing formulas.
            iff(gettype(value1[0]) == "string", todouble(tostring(value1[0]) != tostring(value2[0])), abs(todouble(value1[0]) - todouble(value2[0])))
    };
    let T = (tbl | extend _entity = column_ifexists(id_col, ''), _partition = column_ifexists(partition_col, '') | project-reorder _entity, _partition);
    let sum_data = (
        // Calculates summary statistics to be used for normalization.
        T
        | project-reorder _entity
        | project _partition, p = pack_array(*)
        | mv-expand with_itemindex=idx p
        | summarize count(), avg(todouble(p)), stdev(todouble(p)) by _partition, idx
        | sort by _partition, idx asc
        | summarize make_list(avg_p), make_list(stdev_p) by _partition
    );
    let normalized_data = (
        // Performs normalization on numerical variables by substrcting mean and scaling by standard deviation. Other normalization techniques can be used
        // by adding metrics to previous function and using here.
        T
        | project _partition, p = pack_array(*)
        | join kind = leftouter (sum_data) on _partition
        | mv-apply p, list_avg_p, list_stdev_p on (
            extend normalized = iff((not(isnan(todouble(list_avg_p))) and (list_stdev_p > 0)), pack_array((todouble(p) - todouble(list_avg_p))/todouble(list_stdev_p)), p)
            | summarize a = make_list(normalized) by _partition
        )
        | project _partition, a
    );
    let dist_data = (
        // Calculates distances of included variables and sums them up to get a multivariate distance between all entities under the same partition.
        normalized_data
        | join kind = inner (normalized_data) on _partition
        | project entity = tostring(a[0]), entity1 = tostring(a1[0]), a = array_slice(a, 1, -1), a1 = array_slice(a1, 1, -1), _partition
        | mv-apply a, a1 on 
        (
            project d = generic_dist(pack_array(a), pack_array(a1))
            | summarize d = make_list(d)
        )
        | extend dist = bin((1.0*array_sum(d)-1.0)/array_length(d), 0.0001) // -1 cancels the artifact distance calculated between entity names appearing in the bag and normalizes by number of features        
        | project-away d
        | where entity != entity1
        | sort by _partition asc, entity asc, dist asc
    );
    dist_data
};
// Write your query to use the function here.

次 .create functionのを使用して、ストアド関数を 1 回定義します。データベースユーザーのアクセス許可が必要です。

重要

このコードを実行して関数を作成してから、例に示すように関数を使用する必要があります。

.create-or-alter function with (folder = "Packages\\Stats", docstring = "Calculate distances between pairs of entites based on multiple nominal and numerical variables")
pairwise_dist_fl(tbl:(*), id_col:string, partition_col:string)
{
    let generic_dist = (value1:dynamic, value2:dynamic) 
    {
        // Calculates the distance between two values; treats all strings as nominal values and numbers as numerical,
        // can be extended to other data types or tweaked by adding weights or changing formulas.
            iff(gettype(value1[0]) == "string", todouble(tostring(value1[0]) != tostring(value2[0])), abs(todouble(value1[0]) - todouble(value2[0])))
    };
    let T = (tbl | extend _entity = column_ifexists(id_col, ''), _partition = column_ifexists(partition_col, '') | project-reorder _entity, _partition);
    let sum_data = (
        // Calculates summary statistics to be used for normalization.
        T
        | project-reorder _entity
        | project _partition, p = pack_array(*)
        | mv-expand with_itemindex=idx p
        | summarize count(), avg(todouble(p)), stdev(todouble(p)) by _partition, idx
        | sort by _partition, idx asc
        | summarize make_list(avg_p), make_list(stdev_p) by _partition
    );
    let normalized_data = (
        // Performs normalization on numerical variables by substrcting mean and scaling by standard deviation. Other normalization techniques can be used
        // by adding metrics to previous function and using here.
        T
        | project _partition, p = pack_array(*)
        | join kind = leftouter (sum_data) on _partition
        | mv-apply p, list_avg_p, list_stdev_p on (
            extend normalized = iff((not(isnan(todouble(list_avg_p))) and (list_stdev_p > 0)), pack_array((todouble(p) - todouble(list_avg_p))/todouble(list_stdev_p)), p)
            | summarize a = make_list(normalized) by _partition
        )
        | project _partition, a
    );
    let dist_data = (
        // Calculates distances of included variables and sums them up to get a multivariate distance between all entities under the same partition.
        normalized_data
        | join kind = inner (normalized_data) on _partition
        | project entity = tostring(a[0]), entity1 = tostring(a1[0]), a = array_slice(a, 1, -1), a1 = array_slice(a1, 1, -1), _partition
        | mv-apply a, a1 on 
        (
            project d = generic_dist(pack_array(a), pack_array(a1))
            | summarize d = make_list(d)
        )
        | extend dist = bin((1.0*array_sum(d)-1.0)/array_length(d), 0.0001) // -1 cancels the artifact distance calculated between entity names appearing in the bag and normalizes by number of features        
        | project-away d
        | where entity != entity1
        | sort by _partition asc, entity asc, dist asc
    );
    dist_data
}

例

次の例では、 invoke 演算子を使用して関数を実行します。

クエリ定義
保管済み

クエリ定義関数を使用するには、埋め込み関数定義の後で呼び出します。

クエリを実行する

let pairwise_dist_fl = (tbl:(*), id_col:string, partition_col:string)
{
    let generic_dist = (value1:dynamic, value2:dynamic) 
    {
        // Calculates the distance between two values; treats all strings as nominal values and numbers as numerical,
        // can be extended to other data types or tweaked by adding weights or changing formulas.
            iff(gettype(value1[0]) == "string", todouble(tostring(value1[0]) != tostring(value2[0])), abs(todouble(value1[0]) - todouble(value2[0])))
    };
    let T = (tbl | extend _entity = column_ifexists(id_col, ''), _partition = column_ifexists(partition_col, '') | project-reorder _entity, _partition);
    let sum_data = (
        // Calculates summary statistics to be used for normalization.
        T
        | project-reorder _entity
        | project _partition, p = pack_array(*)
        | mv-expand with_itemindex=idx p
        | summarize count(), avg(todouble(p)), stdev(todouble(p)) by _partition, idx
        | sort by _partition, idx asc
        | summarize make_list(avg_p), make_list(stdev_p) by _partition
    );
    let normalized_data = (
        // Performs normalization on numerical variables by substrcting mean and scaling by standard deviation. Other normalization techniques can be used
        // by adding metrics to previous function and using here.
        T
        | project _partition, p = pack_array(*)
        | join kind = leftouter (sum_data) on _partition
        | mv-apply p, list_avg_p, list_stdev_p on (
            extend normalized = iff((not(isnan(todouble(list_avg_p))) and (list_stdev_p > 0)), pack_array((todouble(p) - todouble(list_avg_p))/todouble(list_stdev_p)), p)
            | summarize a = make_list(normalized) by _partition
        )
        | project _partition, a
    );
    let dist_data = (
        // Calculates distances of included variables and sums them up to get a multivariate distance between all entities under the same partition.
        normalized_data
        | join kind = inner (normalized_data) on _partition
        | project entity = tostring(a[0]), entity1 = tostring(a1[0]), a = array_slice(a, 1, -1), a1 = array_slice(a1, 1, -1), _partition
        | mv-apply a, a1 on 
        (
            project d = generic_dist(pack_array(a), pack_array(a1))
            | summarize d = make_list(d)
        )
        | extend dist = bin((1.0*array_sum(d)-1.0)/array_length(d), 0.0001) // -1 cancels the artifact distance calculated between entity names appearing in the bag and normalizes by number of features        
        | project-away d
        | where entity != entity1
        | sort by _partition asc, entity asc, dist asc
    );
    dist_data
};
//
let raw_data = datatable(name:string, gender: string, height:int, weight:int, limbs:int, accessory:string, type:string)[
    'Andy',     'M',    160,    80,     4,  'Hat',      'Person',
    'Betsy',    'F',    170,    70,     4,  'Bag',      'Person',
    'Cindy',    'F',    130,    30,     4,  'Hat',      'Person',
    'Dan',      'M',    190,    105,    4,  'Hat',      'Person',
    'Elmie',    'M',    110,    30,     4,  'Toy',      'Person',
    'Franny',   'F',    170,    65,     4,  'Bag',      'Person',
    'Godzilla', '?',    260,    210,    5,  'Tail',     'Person',
    'Hannie',   'F',    112,    28,     4,  'Toy',      'Person',
    'Ivie',     'F',    105,    20,     4,  'Toy',      'Person',
    'Johnnie',  'M',    107,    21,     4,  'Toy',      'Person',
    'Kyle',     'M',    175,    76,     4,  'Hat',      'Person',
    'Laura',    'F',    180,    70,     4,  'Bag',      'Person',
    'Mary',     'F',    160,    60,     4,  'Bag',      'Person',
    'Noah',     'M',    178,    90,     4,  'Hat',      'Person',
    'Odelia',   'F',    186,    76,     4,  'Bag',      'Person',
    'Paul',     'M',    158,    69,     4,  'Bag',      'Person',
    'Qui',      'F',    168,    62,     4,  'Bag',      'Person',
    'Ronnie',   'M',    108,    26,     4,  'Toy',      'Person',
    'Sonic',    'F',    52,     20,     6,  'Tail',     'Pet',
    'Tweety',   'F',    52,     20,     6,  'Tail',     'Pet' ,
    'Ulfie',    'M',    39,     29,     4,  'Wings',    'Pet',
    'Vinnie',   'F',    53,     22,     1,  'Tail',     'Pet',
    'Waldo',    'F',    51,     21,     4,  'Tail',     'Pet',
    'Xander',   'M',    50,     24,     4,  'Tail',     'Pet'
];
raw_data
| invoke pairwise_dist_fl('name', 'type')
| where _partition == 'Person' | sort by entity asc, entity1 asc
| evaluate pivot (entity, max(dist), entity1) | sort by entity1 asc

重要

この例を正常に実行するには、まず関数定義コードを実行して関数を格納する必要があります。

let raw_data = datatable(name:string, gender: string, height:int, weight:int, limbs:int, accessory:string, type:string)[
    'Andy',     'M',    160,    80,     4,  'Hat',      'Person',
    'Betsy',    'F',    170,    70,     4,  'Bag',      'Person',
    'Cindy',    'F',    130,    30,     4,  'Hat',      'Person',
    'Dan',      'M',    190,    105,    4,  'Hat',      'Person',
    'Elmie',    'M',    110,    30,     4,  'Toy',      'Person',
    'Franny',   'F',    170,    65,     4,  'Bag',      'Person',
    'Godzilla', '?',    260,    210,    5,  'Tail',     'Person',
    'Hannie',   'F',    112,    28,     4,  'Toy',      'Person',
    'Ivie',     'F',    105,    20,     4,  'Toy',      'Person',
    'Johnnie',  'M',    107,    21,     4,  'Toy',      'Person',
    'Kyle',     'M',    175,    76,     4,  'Hat',      'Person',
    'Laura',    'F',    180,    70,     4,  'Bag',      'Person',
    'Mary',     'F',    160,    60,     4,  'Bag',      'Person',
    'Noah',     'M',    178,    90,     4,  'Hat',      'Person',
    'Odelia',   'F',    186,    76,     4,  'Bag',      'Person',
    'Paul',     'M',    158,    69,     4,  'Bag',      'Person',
    'Qui',      'F',    168,    62,     4,  'Bag',      'Person',
    'Ronnie',   'M',    108,    26,     4,  'Toy',      'Person',
    'Sonic',    'F',    52,     20,     6,  'Tail',     'Pet',
    'Tweety',   'F',    52,     20,     6,  'Tail',     'Pet' ,
    'Ulfie',    'M',    39,     29,     4,  'Wings',    'Pet',
    'Vinnie',   'F',    53,     22,     1,  'Tail',     'Pet',
    'Woody',    'F',    51,     21,     4,  'Tail',     'Pet',
    'Xander',   'M',    50,     24,     4,  'Tail',     'Pet'
];
raw_data
| invoke pairwise_dist_fl('name', 'type')
| where _partition == 'Person' | sort by entity asc, entity1 asc
| evaluate pivot (entity, max(dist), entity1) | sort by entity1 asc

出力

entity1	アンディ	Betsy	シンディ	Dan	Elmie	フラニー	ゴジラ	Hannie	...
アンディ		0.354	0.4125	0.1887	0.4843	0.3702	1.2087	0.6265	...
Betsy	0.354		0.416	0.4708	0.6307	0.0161	1.2051	0.4872	...
シンディ	0.4125	0.416		0.6012	0.3575	0.3998	1.4783	0.214	...
Dan	0.1887	0.4708	0.6012		0.673	0.487	1.0199	0.8152	...
Elmie	0.4843	0.6307	0.3575	0.673		0.6145	1.5502	0.1565	...
フラニー	0.3702	0.0161	0.3998	0.487	0.6145		1.2213	0.471	...
ゴジラ	1.2087	1.2051	1.4783	1.0199	1.5502	1.2213		1.5495	...
Hannie	0.6265	0.4872	0.214	0.8152	0.1565	0.471	1.5495		...
...	...	...	...	...	...	...	...	...	...

2 つの異なる型のエンティティを見ていて、同じ型に属するエンティティ間の距離を、名義変数 (性別や好みのアクセサリなど) と数値変数 (手足の数、身長、体重など) の両方を考慮に入れて計算したいと考えたとします。数値変数は尺度がそれぞれ異なっており、一元化してスケーリングする必要があります。これは自動的に行われます。出力は、計算された多変量距離を持つ同じパーティション内のエンティティのペアです。これは、直接分析したり、距離行列または散布図として視覚化したり、外れ値検出アルゴリズムの入力データとして使用したりできます。このアルゴリズムではエンティティあたりの平均距離が計算され、高い値のエンティティがグローバル外れ値であることを示唆します。たとえば、距離行列を使用してオプションの視覚化を追加する場合、サンプルで示した方法でテーブルを取得します。このサンプルから、次の点を確認できます。

エンティティのペア (Betsy と Franny) の中には、類似していることを示す低距離値 (0 に近い) があります。
エンティティの一部のペア (Godzilla と Elmie) の距離が高い値 (1 以上) は、それらが異なっていることを示します。

個の出力をさらに使用して、エンティティあたりの平均距離を計算できます。平均距離の大きさは、グローバル外れ値を示している可能性があります。たとえば、Godzilla は他との距離が平均的に大きく、Godzilla がグローバル外れ値の可能性を示唆しています。

次の方法で共有

pairwise_dist_fl()

構文

パラメーター

関数の定義

例

フィードバック

フィードバック

その他のリソース