pairwise_dist_fl()

아티클
01/18/2024

여러 명목 변수와 숫자 변수를 기반으로 엔터티 간의 쌍별 거리를 계산합니다.

함수 pairwise_dist_fl() 는 명목 변수와 숫자 변수를 고려하여 동일한 파티션에 속하는 데이터 요소 간의 다변량 거리를 계산하는 UDF(사용자 정의 함수) 입니다.

엔터티 및 파티션 이름 외에 모든 문자열 필드는 명목 변수로 간주됩니다. 값이 다르면 거리가 1이고, 값이 같으면 0과 같습니다.
모든 숫자 필드는 숫자 변수로 간주됩니다. z 점수로 변환하여 정규화되고 거리는 차이의 절대 값으로 계산됩니다. 데이터 요소 간의 총 다변량 거리는 변수 간 거리의 평균으로 계산됩니다.

0에 가까운 거리는 엔터티가 비슷하고 1 이상의 거리가 서로 다르다는 것을 의미합니다. 마찬가지로 평균 거리가 1 이상인 엔터티는 파티션의 다른 많은 엔터티와 다르며 잠재적 이상값을 나타냅니다.

함수의 출력은 동일한 파티션 아래의 엔터티 간에 쌍을 이루는 거리입니다. 유사하거나 다른 쌍을 찾기 위해 있는 그대로 사용할 수 있습니다. 거리가 최소화된 엔터티와 같은 은 많은 공통 기능을 공유합니다. 거리 행렬(사용 예제 참조)으로 쉽게 변환하거나 클러스터링 또는 이상값 검색 알고리즘에 대한 입력으로 사용할 수도 있습니다.

Syntax

pairwise_dist_fl(엔터티, 파티션)

구문 규칙에 대해 자세히 알아봅니다.

매개 변수

이름	형식	필수	Description
엔터티	`string`	✔️	거리를 계산할 엔터티의 이름 또는 ID를 포함하는 입력 테이블 열의 이름입니다.
partition	`string`	✔️	파티션 또는 scope 포함하는 입력 테이블 열의 이름으로, 동일한 파티션 아래의 모든 엔터티 쌍에 대해 거리가 계산됩니다.

함수 정의

코드를 쿼리 정의 함수로 포함하거나 다음과 같이 데이터베이스에 저장된 함수로 만들어 함수를 정의할 수 있습니다.

쿼리 정의
보관됨

다음 let 문을 사용하여 함수를 정의합니다. 사용 권한이 필요 없습니다.

중요

let 문은 자체적으로 실행할 수 없습니다. 테이블 형식 식 문 뒤에 이어서야 합니다. 의 작업 예제를 실행하려면 예제 pairwise_dist_fl()를 참조 하세요.

let pairwise_dist_fl = (tbl:(*), id_col:string, partition_col:string)
{
    let generic_dist = (value1:dynamic, value2:dynamic) 
    {
        // Calculates the distance between two values; treats all strings as nominal values and numbers as numerical,
        // can be extended to other data types or tweaked by adding weights or changing formulas.
            iff(gettype(value1[0]) == "string", todouble(tostring(value1[0]) != tostring(value2[0])), abs(todouble(value1[0]) - todouble(value2[0])))
    };
    let T = (tbl | extend _entity = column_ifexists(id_col, ''), _partition = column_ifexists(partition_col, '') | project-reorder _entity, _partition);
    let sum_data = (
        // Calculates summary statistics to be used for normalization.
        T
        | project-reorder _entity
        | project _partition, p = pack_array(*)
        | mv-expand with_itemindex=idx p
        | summarize count(), avg(todouble(p)), stdev(todouble(p)) by _partition, idx
        | sort by _partition, idx asc
        | summarize make_list(avg_p), make_list(stdev_p) by _partition
    );
    let normalized_data = (
        // Performs normalization on numerical variables by substrcting mean and scaling by standard deviation. Other normalization techniques can be used
        // by adding metrics to previous function and using here.
        T
        | project _partition, p = pack_array(*)
        | join kind = leftouter (sum_data) on _partition
        | mv-apply p, list_avg_p, list_stdev_p on (
            extend normalized = iff((not(isnan(todouble(list_avg_p))) and (list_stdev_p > 0)), pack_array((todouble(p) - todouble(list_avg_p))/todouble(list_stdev_p)), p)
            | summarize a = make_list(normalized) by _partition
        )
        | project _partition, a
    );
    let dist_data = (
        // Calculates distances of included variables and sums them up to get a multivariate distance between all entities under the same partition.
        normalized_data
        | join kind = inner (normalized_data) on _partition
        | project entity = tostring(a[0]), entity1 = tostring(a1[0]), a = array_slice(a, 1, -1), a1 = array_slice(a1, 1, -1), _partition
        | mv-apply a, a1 on 
        (
            project d = generic_dist(pack_array(a), pack_array(a1))
            | summarize d = make_list(d)
        )
        | extend dist = bin((1.0*array_sum(d)-1.0)/array_length(d), 0.0001) // -1 cancels the artifact distance calculated between entity names appearing in the bag and normalizes by number of features        
        | project-away d
        | where entity != entity1
        | sort by _partition asc, entity asc, dist asc
    );
    dist_data
};
// Write your query to use the function here.

다음 을 사용하여 저장 함수를 한 번 정의합니다 .create function. 데이터베이스 사용자 권한이 필요합니다.

중요

예제와 같이 함수를 사용하려면 먼저 이 코드를 실행하여 함수를 만들어야 합니다.

.create-or-alter function with (folder = "Packages\\Stats", docstring = "Calculate distances between pairs of entites based on multiple nominal and numerical variables")
let pairwise_dist_fl = (tbl:(*), id_col:string, partition_col:string)
{
    let generic_dist = (value1:dynamic, value2:dynamic) 
    {
        // Calculates the distance between two values; treats all strings as nominal values and numbers as numerical,
        // can be extended to other data types or tweaked by adding weights or changing formulas.
            iff(gettype(value1[0]) == "string", todouble(tostring(value1[0]) != tostring(value2[0])), abs(todouble(value1[0]) - todouble(value2[0])))
    };
    let T = (tbl | extend _entity = column_ifexists(id_col, ''), _partition = column_ifexists(partition_col, '') | project-reorder _entity, _partition);
    let sum_data = (
        // Calculates summary statistics to be used for normalization.
        T
        | project-reorder _entity
        | project _partition, p = pack_array(*)
        | mv-expand with_itemindex=idx p
        | summarize count(), avg(todouble(p)), stdev(todouble(p)) by _partition, idx
        | sort by _partition, idx asc
        | summarize make_list(avg_p), make_list(stdev_p) by _partition
    );
    let normalized_data = (
        // Performs normalization on numerical variables by substrcting mean and scaling by standard deviation. Other normalization techniques can be used
        // by adding metrics to previous function and using here.
        T
        | project _partition, p = pack_array(*)
        | join kind = leftouter (sum_data) on _partition
        | mv-apply p, list_avg_p, list_stdev_p on (
            extend normalized = iff((not(isnan(todouble(list_avg_p))) and (list_stdev_p > 0)), pack_array((todouble(p) - todouble(list_avg_p))/todouble(list_stdev_p)), p)
            | summarize a = make_list(normalized) by _partition
        )
        | project _partition, a
    );
    let dist_data = (
        // Calculates distances of included variables and sums them up to get a multivariate distance between all entities under the same partition.
        normalized_data
        | join kind = inner (normalized_data) on _partition
        | project entity = tostring(a[0]), entity1 = tostring(a1[0]), a = array_slice(a, 1, -1), a1 = array_slice(a1, 1, -1), _partition
        | mv-apply a, a1 on 
        (
            project d = generic_dist(pack_array(a), pack_array(a1))
            | summarize d = make_list(d)
        )
        | extend dist = bin((1.0*array_sum(d)-1.0)/array_length(d), 0.0001) // -1 cancels the artifact distance calculated between entity names appearing in the bag and normalizes by number of features        
        | project-away d
        | where entity != entity1
        | sort by _partition asc, entity asc, dist asc
    );
    dist_data
};

예제

다음 예제에서는 invoke 연산자를 사용하여 함수를 실행합니다.

쿼리 정의
보관됨

쿼리 정의 함수를 사용하려면 포함된 함수 정의 다음에 호출합니다.

쿼리 실행

let pairwise_dist_fl = (tbl:(*), id_col:string, partition_col:string)
{
    let generic_dist = (value1:dynamic, value2:dynamic) 
    {
        // Calculates the distance between two values; treats all strings as nominal values and numbers as numerical,
        // can be extended to other data types or tweaked by adding weights or changing formulas.
            iff(gettype(value1[0]) == "string", todouble(tostring(value1[0]) != tostring(value2[0])), abs(todouble(value1[0]) - todouble(value2[0])))
    };
    let T = (tbl | extend _entity = column_ifexists(id_col, ''), _partition = column_ifexists(partition_col, '') | project-reorder _entity, _partition);
    let sum_data = (
        // Calculates summary statistics to be used for normalization.
        T
        | project-reorder _entity
        | project _partition, p = pack_array(*)
        | mv-expand with_itemindex=idx p
        | summarize count(), avg(todouble(p)), stdev(todouble(p)) by _partition, idx
        | sort by _partition, idx asc
        | summarize make_list(avg_p), make_list(stdev_p) by _partition
    );
    let normalized_data = (
        // Performs normalization on numerical variables by substrcting mean and scaling by standard deviation. Other normalization techniques can be used
        // by adding metrics to previous function and using here.
        T
        | project _partition, p = pack_array(*)
        | join kind = leftouter (sum_data) on _partition
        | mv-apply p, list_avg_p, list_stdev_p on (
            extend normalized = iff((not(isnan(todouble(list_avg_p))) and (list_stdev_p > 0)), pack_array((todouble(p) - todouble(list_avg_p))/todouble(list_stdev_p)), p)
            | summarize a = make_list(normalized) by _partition
        )
        | project _partition, a
    );
    let dist_data = (
        // Calculates distances of included variables and sums them up to get a multivariate distance between all entities under the same partition.
        normalized_data
        | join kind = inner (normalized_data) on _partition
        | project entity = tostring(a[0]), entity1 = tostring(a1[0]), a = array_slice(a, 1, -1), a1 = array_slice(a1, 1, -1), _partition
        | mv-apply a, a1 on 
        (
            project d = generic_dist(pack_array(a), pack_array(a1))
            | summarize d = make_list(d)
        )
        | extend dist = bin((1.0*array_sum(d)-1.0)/array_length(d), 0.0001) // -1 cancels the artifact distance calculated between entity names appearing in the bag and normalizes by number of features        
        | project-away d
        | where entity != entity1
        | sort by _partition asc, entity asc, dist asc
    );
    dist_data
};
//
let raw_data = datatable(name:string, gender: string, height:int, weight:int, limbs:int, accessory:string, type:string)[
    'Andy',     'M',    160,    80,     4,  'Hat',      'Person',
    'Betsy',    'F',    170,    70,     4,  'Bag',      'Person',
    'Cindy',    'F',    130,    30,     4,  'Hat',      'Person',
    'Dan',      'M',    190,    105,    4,  'Hat',      'Person',
    'Elmie',    'M',    110,    30,     4,  'Toy',      'Person',
    'Franny',   'F',    170,    65,     4,  'Bag',      'Person',
    'Godzilla', '?',    260,    210,    5,  'Tail',     'Person',
    'Hannie',   'F',    112,    28,     4,  'Toy',      'Person',
    'Ivie',     'F',    105,    20,     4,  'Toy',      'Person',
    'Johnnie',  'M',    107,    21,     4,  'Toy',      'Person',
    'Kyle',     'M',    175,    76,     4,  'Hat',      'Person',
    'Laura',    'F',    180,    70,     4,  'Bag',      'Person',
    'Mary',     'F',    160,    60,     4,  'Bag',      'Person',
    'Noah',     'M',    178,    90,     4,  'Hat',      'Person',
    'Odelia',   'F',    186,    76,     4,  'Bag',      'Person',
    'Paul',     'M',    158,    69,     4,  'Bag',      'Person',
    'Qui',      'F',    168,    62,     4,  'Bag',      'Person',
    'Ronnie',   'M',    108,    26,     4,  'Toy',      'Person',
    'Sonic',    'F',    52,     20,     6,  'Tail',     'Pet',
    'Tweety',   'F',    52,     20,     6,  'Tail',     'Pet' ,
    'Ulfie',    'M',    39,     29,     4,  'Wings',    'Pet',
    'Vinnie',   'F',    53,     22,     1,  'Tail',     'Pet',
    'Waldo',    'F',    51,     21,     4,  'Tail',     'Pet',
    'Xander',   'M',    50,     24,     4,  'Tail',     'Pet'
];
raw_data
| invoke pairwise_dist_fl('name', 'type')
| where _partition == 'Person' | sort by entity asc, entity1 asc
| evaluate pivot (entity, max(dist), entity1) | sort by entity1 asc

중요

이 예제를 성공적으로 실행하려면 먼저 함수 정의 코드를 실행하여 함수를 저장해야 합니다.

let raw_data = datatable(name:string, gender: string, height:int, weight:int, limbs:int, accessory:string, type:string)[
    'Andy',     'M',    160,    80,     4,  'Hat',      'Person',
    'Betsy',    'F',    170,    70,     4,  'Bag',      'Person',
    'Cindy',    'F',    130,    30,     4,  'Hat',      'Person',
    'Dan',      'M',    190,    105,    4,  'Hat',      'Person',
    'Elmie',    'M',    110,    30,     4,  'Toy',      'Person',
    'Franny',   'F',    170,    65,     4,  'Bag',      'Person',
    'Godzilla', '?',    260,    210,    5,  'Tail',     'Person',
    'Hannie',   'F',    112,    28,     4,  'Toy',      'Person',
    'Ivie',     'F',    105,    20,     4,  'Toy',      'Person',
    'Johnnie',  'M',    107,    21,     4,  'Toy',      'Person',
    'Kyle',     'M',    175,    76,     4,  'Hat',      'Person',
    'Laura',    'F',    180,    70,     4,  'Bag',      'Person',
    'Mary',     'F',    160,    60,     4,  'Bag',      'Person',
    'Noah',     'M',    178,    90,     4,  'Hat',      'Person',
    'Odelia',   'F',    186,    76,     4,  'Bag',      'Person',
    'Paul',     'M',    158,    69,     4,  'Bag',      'Person',
    'Qui',      'F',    168,    62,     4,  'Bag',      'Person',
    'Ronnie',   'M',    108,    26,     4,  'Toy',      'Person',
    'Sonic',    'F',    52,     20,     6,  'Tail',     'Pet',
    'Tweety',   'F',    52,     20,     6,  'Tail',     'Pet' ,
    'Ulfie',    'M',    39,     29,     4,  'Wings',    'Pet',
    'Vinnie',   'F',    53,     22,     1,  'Tail',     'Pet',
    'Woody',    'F',    51,     21,     4,  'Tail',     'Pet',
    'Xander',   'M',    50,     24,     4,  'Tail',     'Pet'
];
raw_data
| invoke pairwise_dist_fl('name', 'type')
| where _partition == 'Person' | sort by entity asc, entity1 asc
| evaluate pivot (entity, max(dist), entity1) | sort by entity1 asc

출력

entity1	앤디	베시	신디	Dan	Elmie	Franny	무지막지 한	한니 ()	...
앤디		0.354	0.4125	0.1887	0.4843	0.3702	1.2087	0.6265	...
베시	0.354		0.416	0.4708	0.6307	0.0161	1.2051	0.4872	...
신디	0.4125	0.416		0.6012	0.3575	0.3998	1.4783	0.214	...
Dan	0.1887	0.4708	0.6012		0.673	0.487	1.0199	0.8152	...
Elmie	0.4843	0.6307	0.3575	0.673		0.6145	1.5502	0.1565	...
Franny	0.3702	0.0161	0.3998	0.487	0.6145		1.2213	0.471	...
무지막지 한	1.2087	1.2051	1.4783	1.0199	1.5502	1.2213		1.5495	...
한니 ()	0.6265	0.4872	0.214	0.8152	0.1565	0.471	1.5495		...
...	...	...	...	...	...	...	...	...	...

서로 다른 두 형식의 엔터티를 살펴보면 명목 변수(예: 성별 또는 선호 액세서리)와 숫자 변수(예: 사지, 높이 및 가중치 수)를 모두 고려하여 동일한 형식에 속하는 엔터티 간의 거리를 계산하려고 합니다. 숫자 변수는 서로 다른 배율에 있으며 자동으로 수행되는 중앙 집중식 및 크기 조정되어야 합니다. 출력은 계산된 다변량 거리가 있는 동일한 파티션에 있는 엔터티 쌍입니다. 값이 높은 엔터티를 사용하여 엔터티당 평균 거리를 계산하여 직접 분석하거나, 거리 행렬 또는 산점도로 시각화하거나, 이상값 검색 알고리즘의 입력 데이터로 사용할 수 있습니다. 예를 들어 거리 행렬을 사용하여 선택적 시각화를 추가할 때 샘플에 표시된 대로 테이블을 가져옵니다. 샘플에서 다음을 확인할 수 있습니다.

일부 엔터티 쌍(Betsy 및 Franny)은 비슷함을 나타내는 낮은 거리 값(0에 가깝음)을 가집니다.
일부 엔터티 쌍(Godzilla 및 Elmie)은 서로 다르다는 것을 나타내는 높은 거리 값(1 이상)을 가집니다.

출력을 사용하여 엔터티당 평균 거리를 계산할 수 있습니다. 평균 거리가 높을수록 전역 이상값을 나타낼 수 있습니다. 예를 들어, 평균적으로 Godzilla는 다른 고질라와의 거리가 높아 글로벌 이상값일 가능성이 있음을 알 수 있습니다.

다음을 통해 공유

pairwise_dist_fl()

Syntax

매개 변수

함수 정의

예제

피드백

피드백

추가 리소스