Azure Databricks에서 Azure Data Lake Storage Gen1에 액세스

아티클
03/01/2024

Microsoft는 Azure Data Lake Storage Gen1(이전의 Azure Data Lake Store, ADLS라고도 함)의 사용 중지 계획을 발표했으며 모든 사용자가 Azure Data Lake Storage Gen2로 마이그레이션할 것을 권장합니다. Databricks는 최상의 성능과 새로운 기능을 위해 Azure Data Lake Storage Gen2로 업그레이드할 것을 권장합니다.

Azure Data Lake Storage Gen1에 액세스하는 두 가지 방법:

자격 증명 통과라고 도 하는 Microsoft Entra ID(이전의 Azure Active Directory) 자격 증명을 전달합니다.
서비스 주체를 직접 사용합니다.

Microsoft Entra ID(이전의 Azure Active Directory) 자격 증명을 사용하여 자동으로 액세스

Azure Databricks에 로그인하는 데 사용하는 것과 동일한 Microsoft Entra ID ID를 사용하여 Azure Databricks 클러스터에서 Azure Data Lake Storage Gen1에 자동으로 인증할 수 있습니다. Microsoft Entra ID 자격 증명 통과에 클러스터를 사용하도록 설정하면 스토리지에 액세스하기 위해 서비스 주체 자격 증명을 구성하지 않고도 해당 클러스터에서 실행하는 명령이 Azure Data Lake Storage Gen1에서 데이터를 읽고 쓸 수 있습니다.

전체 설정 및 사용 지침은 Microsoft Entra ID(이전의 Azure Active Directory) 자격 증명 통과(레거시)를 사용하여 Azure Data Lake Storage에 액세스하세요.

서비스 주체에 권한 생성 및 부여

선택한 액세스 방법에 적절한 권한이 있는 서비스 주체가 필요한데 권한이 없는 경우 다음 단계를 수행합니다.

리소스에 액세스할 수 있는 Microsoft Entra ID(이전의 Azure Active Directory) 애플리케이션 및 서비스 주체를 만듭니다. 다음 속성을 확인합니다.
- application-id: 클라이언트 애플리케이션을 고유하게 식별하는 ID입니다.
- directory-id: Microsoft Entra ID 인스턴스를 고유하게 식별하는 ID입니다.
- service-credential: ID를 증명하는 데 사용하는 문자열입니다.
Azure Data Lake Storage Gen1 계정에 기여자 등의 올바른 역할 할당을 부여하여 서비스 주체를 등록합니다.

서비스 주체 및 OAuth 2.0을 사용하여 Spark API로 직접 액세스

Azure Data Lake Storage Gen1 계정에서 읽으려면 Notebook에서 다음 코드 조각과 함께 서비스 자격 증명을 사용하도록 Spark를 구성할 수 있습니다.

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

라는 설치 관리자 실행 파일에 포함됩니다. 여기서

dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")은 비밀 범위에 비밀로 저장된 스토리지 계정 액세스 키를 검색합니다.

자격 증명을 설정한 후 표준 Spark 및 Databricks API를 사용하여 리소스에 액세스할 수 있습니다. 예시:

val df = spark.read.format("parquet").load("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

dbutils.fs.ls("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

Azure Data Lake Storage Gen1은 디렉터리 수준의 액세스 제어를 제공하므로 서비스 주체는 Azure Data Lake Storage Gen1 리소스뿐만 아니라 읽으려는 디렉터리에 액세스할 수 있어야 합니다.

메타스토어를 통한 액세스

메타스토어에 지정된 adl:// 위치에 액세스하려면 메타스토어에서 사용하는 Hadoop 구성에 전파하기 위해 해당 Hadoop 구성 키에 spark.hadoop. 접두사를 추가하여 클러스터를 만들 때 Hadoop 자격 증명 구성 옵션을 Spark 옵션으로 지정해야 합니다.

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.oauth2.client.id <application-id>
spark.hadoop.fs.adl.oauth2.credential <service-credential>
spark.hadoop.fs.adl.oauth2.refresh.url https://login.microsoftonline.com/<directory-id>/oauth2/token

Warning

이러한 자격 증명은 클러스터에 액세스하는 모든 사용자가 사용할 수 있습니다.

Azure Data Lake Storage Gen1 리소스 또는 폴더 장착

Azure Data Lake Storage Gen1 리소스 또는 그 안에 폴더를 탑재하려면 다음 명령을 사용합니다.

Python

configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",
          "fs.adl.oauth2.client.id": "<application-id>",
          "fs.adl.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
          "fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Scala

val configs = Map(
  "fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  "fs.adl.oauth2.client.id" -> "<application-id>",
  "fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
  "fs.adl.oauth2.refresh.url" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)

라는 설치 관리자 실행 파일에 포함됩니다. 여기서

<mount-name>은 Azure Data Lake Storage Gen1 계정 또는 그 안에 있는 폴더(source에 지정됨)가 DBFS에 탑재되는 위치를 나타내는 DBFS 경로입니다.
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")은 비밀 범위에 비밀로 저장된 스토리지 계정 액세스 키를 검색합니다.

컨테이너의 파일에 로컬 파일인 것처럼 액세스합니다. 예를 들면 다음과 같습니다.

Python

df = spark.read.format("text").load("/mnt/<mount-name>/....")
df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

Scala

val df = spark.read.format("text").load("/mnt/<mount-name>/....")
val df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

여러 계정에 대한 서비스 자격 증명 설정

구성 키에 account.<account-name>을 추가하여 단일 Spark 세션에서 사용할 여러 Azure Data Lake Storage Gen1 계정에 대한 서비스 자격 증명을 설정할 수 있습니다. 예를 들어 adl://example1.azuredatalakestore.net과 adl://example2.azuredatalakestore.net에 액세스할 두 계정에 모두에 자격 증명을 설정하려는 경우 다음과 같이 이 작업을 수행할 수 있습니다.

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")

spark.conf.set("fs.adl.account.example1.oauth2.client.id", "<application-id-example1>")
spark.conf.set("fs.adl.account.example1.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example1>"))
spark.conf.set("fs.adl.account.example1.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example1>/oauth2/token")

spark.conf.set("fs.adl.account.example2.oauth2.client.id", "<application-id-example2>")
spark.conf.set("fs.adl.account.example2.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example2>"))
spark.conf.set("fs.adl.account.example2.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example2>/oauth2/token")

클러스터 Spark 구성에서도 작동합니다.

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential

spark.hadoop.fs.adl.account.example1.oauth2.client.id <application-id-example1>
spark.hadoop.fs.adl.account.example1.oauth2.credential <service-credential-example1>
spark.hadoop.fs.adl.account.example1.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example1>/oauth2/token

spark.hadoop.fs.adl.account.example2.oauth2.client.id <application-id-example2>
spark.hadoop.fs.adl.account.example2.oauth2.credential <service-credential-example2>
spark.hadoop.fs.adl.account.example2.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example2>/oauth2/token

다음 Notebook에서는 탑재로 Azure Data Lake Storage Gen1에 직접 액세스하는 방법을 보여 줍니다.

ADLS Gen1 서비스 주체 Notebook

전자 필기장 가져오기

Azure Databricks에서 Azure Data Lake Storage Gen1에 액세스

Microsoft Entra ID(이전의 Azure Active Directory) 자격 증명을 사용하여 자동으로 액세스

서비스 주체에 권한 생성 및 부여

서비스 주체 및 OAuth 2.0을 사용하여 Spark API로 직접 액세스

메타스토어를 통한 액세스

Azure Data Lake Storage Gen1 리소스 또는 폴더 장착

Python

Scala

Python

Scala

여러 계정에 대한 서비스 자격 증명 설정

ADLS Gen1 서비스 주체 Notebook

추가 리소스