사용자 코드를 사용하여 U-SQL 식 확장

아티클
08/29/2023

요약

U-SQL의 주요 값 중 하나는 C#으로 작성된 사용자별 코드를 얼마나 쉽게 추가할 수 있는지입니다. U-SQL의 형식 시스템은 C#을 기반으로 하며 이러한 형식의 인스턴스에 대한 U-SQL 스칼라 식 언어는 C# 식 언어이므로 U-SQL에서 C# 언어의 성능을 사용하기가 매우 쉽습니다.

C# 코드를 사용하여 U-SQL 식을 확장하는 방법에는 여러 가지가 있습니다.

인라인 C# 식
스칼라 값 중 하나를 처리하기 위해 작은 C# 메서드 집합을 적용해야 하는 경우 인라인 C# 식이 적합한 경우가 많습니다. 예를 들어 문자열 형식 메서드 또는 수학 함수입니다.
사용자 정의 집계
C# 어셈블리에서 사용자 정의 집계를 작성하고 U-SQL 스크립트에서 참조합니다. 사용자 정의 집계를 제공하면 사용자 지정 집계 논리를 GROUP BY 절을 사용하여 U-SQL의 집계 처리에 연결할 수 있습니다.
사용자 정의 함수
C# 어셈블리에서 사용자 정의 함수를 작성하고 U-SQL 스크립트에서 참조하는 것이 더 복잡한 함수에 선호됩니다. 함수 논리에 절차 논리 또는 재귀와 같은 식 언어 이외의 C#의 모든 기능이 필요한 경우 더 복잡한 함수가 선호됩니다.
사용자 정의 연산자
C# 어셈블리에서 사용자 정의 연산자를 작성하고 U-SQL 스크립트에서 참조합니다. UDO(사용자 정의 연산자)는 U-SQL의 사용자 지정 코드 행 집합 연산자입니다. C#으로 작성되었으며 행 집합을 생성, 처리 및 사용하는 기능을 제공합니다.

User-Defined 코드 예제

예제는 Azure Data Lake Tools 플러그 인을 사용하여 Visual Studio에서 실행할 수 있습니다.
스크립트는 로컬로 실행할 수 있습니다. 로컬로 실행할 때는 Azure 구독 및 Azure Data Lake Analytics 계정이 필요하지 않습니다.

다음은 사용자 정의 코드를 구현하는 예제를 제공합니다.
어셈블리 사용: Code-Behind 및 어셈블리 등록 연습
인라인 C# 식 ● 라운드 사용 ● 포함
User-Defined 집계 ● SampleAggregate ● genericAggregator
사용자 정의 함수 ● dt_TryParse_USQL ● GetFiscalPeriod ● ReadStringMap/WriteQuotedStringMap ● HasOfficePhone
사용자 정의 연산자 ● 추출기 1 SampleExtractor 1 DriverExtractor 1 FlexExtractor ● 출력기 1 HTMLOutputter 1 DriverOutputter ● 적용자 1 파서애플리에 1 IntegerRangeApplier ● 프로세서 1 FullAddressProcessor 1 CountryName 1 NameProcessor ● 환원기 1 RangeReducer 1 SalesReducer ● 콤비너 1 CombinerEX

어셈블리 사용

Code-Behind 및 어셈블리 등록 연습

사용자 정의 함수, 집계 및 연산자의 경우 C# 어셈블리는 Code-Behind 또는 어셈블리 등록을 사용하여 U-SQL 메타데이터 카탈로그에 로드되어야 합니다. Code-Behind 주요 장점은 도구가 어셈블리 파일을 등록하고 REFERENCE ASSEMBLY 문을 자동으로 추가한다는 것입니다. 일부 단점은 코드가 모든 스크립트 제출에 대해 업로드되고 기능을 다른 사용자와 공유할 수 없다는 것입니다. Code-Behind 및 어셈블리 등록에 대한 자세한 내용은 U-SQL 카탈로그에 U-SQL 어셈블리를 등록하는 방법 및 U-SQL 프로그래밍 기능 가이드: 코드 숨김 사용을 참조하세요.

다음은 Code-Behind 및 어셈블리 등록 모두에서 간단한 함수를 사용하는 연습을 제공합니다. 연습에서는 기존 U-SQL 프로젝트가 있다고 가정합니다.

설치 프로그램

두 메서드 모두 아래에 정의된 데이터 세트와 함수를 공유합니다.

데이터 세트

CREATE DATABASE IF NOT EXISTS TestReferenceDB;
USE DATABASE TestReferenceDB;

DROP TABLE IF EXISTS dbo.simpleTable;
CREATE TABLE dbo.simpleTable
(
    EmpID int,
    EmpName string,
    DeptID int,
    Salary int,
    StartDate DateTime,
    PhoneNumbers string,
    INDEX clx_EmpID CLUSTERED(EmpID)
    DISTRIBUTED BY HASH (EmpID) 
);

INSERT dbo.simpleTable
VALUES
(1, "Noah",   100, 10000, new DateTime(2012,05,31), "cell:030-0074321,office:030-0076545"),
(3, "Liam",   100, 30000, new DateTime(2014,09,14), "cell:(5) 555-3932"),
(6, "Emma",   200, 8000,  new DateTime(2014,03,08), (string)null),
(7, "Jacob",  200, 8000,  new DateTime(2014,09,02), ""),
(8, "Olivia", 200, 8000,  new DateTime(2013,12,11), "office:88.60.15.32"),
(9, "Mason",  300, 50000, new DateTime(2016,01,01), "cell:(91) 555 22 82,office:(91) 555 91 99, home:(425) 555-2819");

기능

namespace myFirstNamespace
{
    public class myFirstClass
    {
        public static string myFirstFunction(string s)
        {
            return s + s;
        }
    }
};

방법 1. Code-Behind 사용

Visual Studio에서 기존 U-SQL 프로젝트에 새 U-SQL 스크립트를 추가합니다. 솔루션 탐색기 새 U-SQL 스크립트와 연결된 새 usql.cs 파일을 엽니다. 전체 내용을 위에 정의된 함수로 바꿉니다. usql.cs 파일을 닫습니다. 아래 코드를 새 U-SQL 스크립트에 추가하여 함수를 호출합니다.

@result =
    SELECT EmpName,
           myFirstNamespace.myFirstClass.myFirstFunction(EmpName) AS myFirstFunction_CB
    FROM TestReferenceDB.dbo.simpleTable;

OUTPUT @result
TO "/Output/ReferenceGuide/DDL/Assemblies/myFirstFunction_CB.txt"
USING Outputters.Tsv();

방법 2. 어셈블리 등록

A. 컴파일 어셈블리

Visual Studio에서 기존 솔루션에 새 Class Library (For U-SQL Application) 를 추가하고 이름을 로 지정합니다 myFirstAssembly. 파일을 Class1.cs 만든 myFirstAssembly후에 열어야 합니다. 의 전체 내용을 Class1.cs 위에서 정의한 함수로 바꾼 다음 파일을 닫습니다. 솔루션 탐색기 마우스 오른쪽 단추를 클릭하고 myFirstAssembly 를 선택합니다Build.

B. 어셈블리 등록

Visual Studio에서 기존 U-SQL 프로젝트에 새 U-SQL 스크립트를 추가하고 아래 코드를 실행하여 어셈블리 myFirstAssembly.dll를 등록합니다.

USE DATABASE TestReferenceDB;
DROP ASSEMBLY IF EXISTS myFirstAssembly;

// modify with your actual path to myFirstAssembly.dll
CREATE ASSEMBLY myFirstAssembly
FROM @"<your path>\myFirstAssembly\bin\Debug\myFirstAssembly.dll";

C. 참조 어셈블리

를 사용하여 REFERENCE ASSEMBLY 새 어셈블리를 참조합니다. 아래 코드는 함수를 호출하기 위한 세 가지 메서드인 myFirstFunction를 제공합니다.

USE DATABASE TestReferenceDB;

/************* Method 1 *************/
REFERENCE ASSEMBLY myFirstAssembly;

@result =
    SELECT "Method 1" AS Method,
            EmpName,
            myFirstNamespace.myFirstClass.myFirstFunction(EmpName) AS myFirstFunction_AR
    FROM TestReferenceDB.dbo.simpleTable;

OUTPUT @result
TO "/Output/ReferenceGuide/DDL/Assemblies/myFirstFunction_AR1.txt"
USING Outputters.Tsv();


/************* Method 2 *************/
REFERENCE ASSEMBLY myFirstAssembly;
USING  myFirstNamespace;

@result =
    SELECT "Method 2" AS Method,
            EmpName,
            myFirstClass.myFirstFunction(EmpName) AS myFirstFunction_AR
    FROM TestReferenceDB.dbo.simpleTable;

OUTPUT @result
TO "/Output/ReferenceGuide/DDL/Assemblies/myFirstFunction_AR2.txt"
USING Outputters.Tsv();


/************* Method 3 *************/
REFERENCE ASSEMBLY myFirstAssembly;
USING xx = myFirstNamespace.myFirstClass;

@result =
    SELECT "Method 3" AS Method,
            EmpName,
            xx.myFirstFunction(EmpName) AS myFirstFunction_AR
    FROM TestReferenceDB.dbo.simpleTable;

OUTPUT @result
TO "/Output/ReferenceGuide/DDL/Assemblies/myFirstFunction_AR3.txt"
USING Outputters.Tsv();

인라인 C# 식 - 라운드 사용
Round를 사용하는 예제입니다. C# 함수 및 연산자(U-SQL)도 참조하세요.

@departments = 
    SELECT * FROM 
        ( VALUES
        ("Newton",  23.00m),
        ("Susan",   25.1234m),
        ("Emma",    25.9999m),
        ("Bradley", 25.9900m)
        ) AS T(Cutomer, Balance);

@result =
    SELECT Cutomer,
            Math.Round(Balance, 2) AS Balance
    FROM @departments;

OUTPUT @result
TO "/Output/ReferenceGuide/BuiltInFunctions/CSharpFunctions/MathMethods/Round.txt"
USING Outputters.Tsv();

사용자 정의 집계 - SampleAggregate
Azure/usql/Examples/Extensibility-Simple-Examples에서 가져온 약간 수정된 예제입니다. c# 코드는 연결된 Code-Behind .cs 파일에 배치됩니다. 아래의 다음 섹션에서 사용량을 참조 하세요.

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace ReferenceGuide_Examples
{
    //User defined aggregate to calculate the total balance by adding or subtracting based on whether its credit or debit
    public class SampleAggregate : IAggregate<string, int, int>
    {
        int balance;

        public override void Init()
        {
            balance = 0;
        }

        public override void Accumulate(string transaction, int amount)
        {
            if (transaction == "Credit")
            {
                balance += amount;
            }
            if (transaction == "Debit")
            {
                balance -= amount;
            }
        }

        public override int Terminate()
        {
            return balance;
        }
    }
}

사용자 정의 집계 사용 - SampleAggregate
사용자 정의 집계는 잔액을 계산합니다. 트랜잭션 유형이 "직불"인 경우 잔액에서 빼고 트랜잭션 유형이 "크레딧"인 경우 잔액에 를 추가합니다. 위의 Code-Behind 사용합니다. 코드 숨김의 장점은 도구가 어셈블리 파일을 등록하고 REFERENCE ASSEMBLY 문을 자동으로 추가한다는 것입니다.

@transactions =
    SELECT * FROM 
        ( VALUES
        ("Bob",     "Credit", 2000),
        ("Olivia",  "Credit", 5000),
        ("Bob",     "Debit",  30),
        ("Olivia",  "Debit",  50),
        ("Bob",     "Debit",  20)           
        ) AS T(customer, transaction, amount);

@balance =
    SELECT  customer,
            AGG<ReferenceGuide_Examples.SampleAggregate>(transaction, amount) AS balance
    FROM @transactions
    GROUP BY customer;

OUTPUT @balance
TO "/Output/ReferenceGuide/Concepts/UserCode/AggregatorA.txt"
USING Outputters.Csv();

사용자 정의 집계 - genericAggregator
제네릭 이름을 사용하여 매개 변수 이름이 전달된 값의 이름과 일치하지 않아도 됨을 보여 주는 기본 집계입니다. c# 코드는 연결된 Code-Behind .cs 파일에 배치됩니다. 아래의 다음 섹션에서 사용량을 참조 하세요.

using Microsoft.Analytics.Interfaces;

namespace ReferenceGuide_Examples
{
    public class genericAggregator : IAggregate<string, string, string>
    {
        string AggregatedValue;

        public override void Init()
        {
            AggregatedValue = "";
        }

        public override void Accumulate(string ValueToAgg, string GroupByValue)
        {
            AggregatedValue += ValueToAgg + ",";
        }

        public override string Terminate()
        {
            // remove last comma
            return AggregatedValue.Substring(0, AggregatedValue.Length - 1);
        }
    }
}

사용자 정의 집계 사용 - genericAggregator A
PhoneType 및 PhoneNumber 는 각 EmpName에 대해 집계됩니다. 이 예제에서는 U-SQL(MAP_AGG)의 예제에 대한 대체 솔루션을 제공합니다. 위의 Code-Behind 사용합니다. 코드 숨김의 장점은 도구가 어셈블리 파일을 등록하고 REFERENCE ASSEMBLY 문을 자동으로 추가한다는 것입니다.

@employees = 
    SELECT * FROM 
        ( VALUES
        ("Noah",   "cell",   "030-0074321"),
        ("Noah",   "office", "030-0076545"),
        ("Sophia", "cell",   "(5) 555-4729"),
        ("Sophia", "office", "(5) 555-3745"),
        ("Liam",   "cell",   "(5) 555-3932"),
        ("Amy",    "cell",   "(171) 555-7788"),
        ("Amy",    "office", "(171) 555-6750"), 
        ("Amy",    "home",   "(425) 555-6238"),
        ("Justin", "cell",   "0921-12 34 65"),
        ("Justin", "office", "0921-12 34 67"),
        ("Emma",   (string)null, (string)null),
        ("Jacob",  "", ""),
        ("Olivia", "cell",   "88.60.15.31"),
        ("Olivia", "office", "88.60.15.32"),
        ("Mason",  "cell",   "(91) 555 22 82"),
        ("Mason",  "office", "(91) 555 91 99"), 
        ("Mason",  "home",   "(425) 555-2819"),
        ("Ava",    "cell",   "91.24.45.40"),
        ("Ava",    "office", "91.24.45.41"),
        ("Ethan",  "cell",   "(604) 555-4729"),
        ("Ethan",  "office", "(604) 555-3745"),
        ("David",  "cell",   "(171) 555-1212"),
        ("Andrew", "cell",   "(1) 135-5555"),
        ("Andrew", "office", "(1) 135-4892"),
        ("Jennie", "cell",   "(5) 555-3392"),
        ("Jennie", "office", "(5) 555-7293")
        ) AS T(EmpName, PhoneType, PhoneNumber);

@result =
    SELECT  EmpName, 
            AGG<ReferenceGuide_Examples.genericAggregator>(PhoneType + ": " + PhoneNumber, EmpName) AS aggregatedList
    FROM @employees
    WHERE !string.IsNullOrWhiteSpace(PhoneType)
    GROUP BY EmpName;

OUTPUT @result 
TO "/Output/ReferenceGuide/Concepts/UserCode/UDA/genericAggregatorA.txt"
ORDER BY EmpName ASC
USING Outputters.Text();

사용자 정의 집계 사용 - genericAggregator B
Producer 는 각 Title에 대해 집계됩니다. 이 예제에서는 U-SQL(ARRAY_AGG)의 예제에 대한 대체 솔루션을 제공합니다. 위의 Code-Behind 사용합니다. 코드 숨김의 장점은 도구가 어셈블리 파일을 등록하고 REFERENCE ASSEMBLY 문을 자동으로 추가한다는 것입니다.

@films = 
    SELECT * FROM 
        ( VALUES
        (1, "A Good Year"),
        (2, "American Gangster"),
        (3, "Robin Hood"),
        (4, "The Counselor")
        ) AS T(FilmID, Title);

@producers = 
    SELECT * FROM 
        ( VALUES
        (1, "Ridley Scott"),
        (2, "Brian Grazer"),
        (3, "Russell Crowe"),
        (4, "Nick Wechsler"),
        (5, "Steve Schwartz"),
        (6, "Paula Mae Schwartz")
        ) AS T(ProducerID, Producer);

@films_producers = 
    SELECT * FROM 
        ( VALUES
        (1, 1),
        (2, 1),
        (2, 2),
        (3, 1),
        (3, 2),
        (3, 3),
        (4, 1),
        (4, 4),
        (4, 5),
        (4, 6)
        ) AS T(FilmID, ProducerID);

@result =
    SELECT f.Title,
           COUNT( * ) AS ProducerCount,        
           AGG<ReferenceGuide_Examples.genericAggregator>(p.Producer, f.Title) AS aggregatedList
    FROM @films AS f
         JOIN
             @films_producers AS fp
         ON f.FilmID == fp.FilmID
         JOIN
             @producers AS p
         ON p.ProducerID == fp.ProducerID
    GROUP BY  f.Title;

OUTPUT @result
TO "/Output/ReferenceGuide/Concepts/UserCode/UDA/genericAggregatorB.csv"
USING Outputters.Csv();

사용자 정의 함수 - HasOfficePhone
c# 코드는 연결된 Code-Behind .cs 파일에 배치됩니다. U-SQL 함수도 참조하세요. 아래의 다음 섹션에서 사용량을 참조 하세요.

namespace ReferenceGuide_Examples
{
    public class SampleFunction
    {
        public static bool HasOfficePhone(string phonenumbers)
        {
            if (string.IsNullOrEmpty(phonenumbers))
            {
                return false;
            }
            else
            { 
                return phonenumbers.Contains("office:");
            }
        }
    }
}

사용자 정의 함수 사용 - HasOfficePhone
직원이 사무실 전화를 가지고 있는지 확인하는 사용자 정의 함수입니다. 위의 Code-Behind 사용합니다. 코드 숨김의 장점은 도구가 어셈블리 파일을 등록하고 REFERENCE ASSEMBLY 문을 자동으로 추가한다는 것입니다.

@employees = 
    SELECT * FROM 
        ( VALUES
        (1, "Noah",   100, (int?)10000, new DateTime(2012,05,31), "cell:030-0074321,office:030-0076545"),
        (3, "Liam",   100, (int)30000, new DateTime(2014,09,14), "cell:(5) 555-3932"),
        (6, "Emma",   200, (int?)8000,  new DateTime(2014,03,08), (string)null),
        (7, "Jacob",  200, (int?)8000,  new DateTime(2014,09,02), ""),
        (8, "Olivia", 200, (int?)8000,  new DateTime(2013,12,11), "office:88.60.15.32"),
        (9, "Mason",  300, (int?)50000, new DateTime(2016,01,01), "cell:(91) 555 22 82,office:(91) 555 91 99, home:(425) 555-2819")
        ) AS T(EmpID, EmpName, DeptID, Salary, StartDate, PhoneNumbers);

@has_office_phone =
    SELECT EmpName,
           ReferenceGuide_Examples.SampleFunction.HasOfficePhone(PhoneNumbers) AS has_office_phone
    FROM @employees;

OUTPUT @has_office_phone
TO "/Output/ReferenceGuide/DDL/Functions/has_office_phone.txt"
USING Outputters.Csv();

사용자 정의 연산자 - NameProcessor
c# 코드는 연결된 Code-Behind .cs 파일에 배치됩니다. 아래의 다음 섹션에서 사용량을 참조 하세요.

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace ReferenceGuide_Examples
{
    //Sample Processor to generate First Initial and last name
    [SqlUserDefinedProcessor]
    public class NameProcessor : IProcessor
    {
        // IRow Process(IRow input, IUpdatableRow output)
        // 
        // Actual implementatoin of the user-defined processor. Overwrites the Process method of IProcessor.
        public override IRow Process(IRow input, IUpdatableRow output)
        {
            string first_name = input.Get<string>("first_name");
            string last_name = input.Get<string>("last_name");
            string name = first_name.Substring(0, 1) + "." + last_name;
            output.Set<string>("name", name);
            return output.AsReadOnly();
        }
    }
}

사용자 정의 연산자 - NameProcessor 사용
프로세서는 first_name 및 last_name 변환하여 first_name_Initial.last_name 형식을 사용합니다. 위의 Code-Behind 사용합니다. 코드 숨김의 장점은 도구가 어셈블리 파일을 등록하고 REFERENCE ASSEMBLY 문을 자동으로 추가한다는 것입니다.

@drivers = 
    SELECT * FROM 
        ( VALUES
        (1, "Maria",     "Anders",   "12209",    "Germany"),
        (3, "Antonio",   "Moreno",   "5023",     "Mexico"),
        (4, "Thomas",    "Hardy",    "WA1 1DP",  "UK"),
        (5, "Christina", "Berglund", "S-958 22", "Sweden"),
        (8, "Martín",    "Sommer",   "28023",    "Spain")
        ) AS T(id, first_name, last_name, zipcode, country);

@drivers_processed =
    PROCESS @drivers
    PRODUCE name string,
            id int,
            zipcode string,
            country string
    READONLY id, zipcode, country
    REQUIRED first_name, last_name
    USING new ReferenceGuide_Examples.NameProcessor();

OUTPUT @drivers_processed
TO "/Output/ReferenceGuide/StatementsAndExpressions/PrimaryRowsetExpressions/Process/drivers_processed.txt"
USING Outputters.Tsv(quoting:false);