Azure Data Lake Storage Gen1 に Azure Databricks からアクセスする

[アーティクル]
03/01/2024

Microsoft は Azure Data Lake Storage Gen1 (旧称 Azure Data Lake Store、別名 ADLS) の廃止計画を発表しており、すべてのユーザーが Azure Data Lake Storage Gen2 に移行することを推奨しています。 Databricks では、最適なパフォーマンスと新機能を得るために、Azure Data Lake Storage Gen2 にアップグレードすることを推奨しています。

Azure Data Lake Storage Gen1 にアクセスする 2 つの方法があります。

Microsoft Entra ID (旧称 Azure Active Directory) 資格情報を渡します (資格情報パススルーとも呼ばれます)。
サービスプリンシパルを直接使用します。

Microsoft Entra ID (旧称 Azure Active Directory) の資格情報を使用して自動的にアクセスする

Azure Databricks へのログインに使うのと同じ Microsoft Entra ID の ID を使って、Azure Databricks クラスターから Azure Data Lake Storage Gen1 に対して自動的に認証できます。 Microsoft Entra ID 資格情報パススルーに対してクラスターを有効にすると、そのクラスターで実行するコマンドによって、ストレージにアクセスするためのサービスプリンシパルの資格情報を構成しなくても、Azure Data Lake Storage Gen1 でデータの読み取りと書き込みを行うことができます。

セットアップと使用方法の詳細については、「Microsoft Entra ID (旧称 Azure Active Directory) 資格情報パススルーを使用して Azure Data Lake Storage にアクセスする (レガシ)」を参照してください。

サービスプリンシパルを作成してアクセス許可を付与する

選択したアクセス方法が適切なアクセス許可を持つサービスプリンシパルを必要とし、アクセス許可が付与されていない場合は、次の手順に従います。

リソースにアクセスできる Microsoft Entra ID (旧称 Azure Active Directory) アプリケーションとサービスプリンシパルを作成します。次のプロパティに注意してください。
- application-id: クライアントアプリケーションを一意に識別する ID。
- directory-id: Microsoft Entra ID インスタンスを一意に識別する ID。
- service-credential: アプリケーションが自身の ID を証明するために使用する文字列。
Azure Data Lake Storage Gen1 アカウントで、正しいロールの割り当て (共同作成者など) を付与して、サービスプリンシパルを登録します。

サービスプリンシパルと OAuth 2.0 を使用して Spark API で直接アクセスする

Azure Data Lake Storage Gen1 アカウントから読み取る場合は、サービス資格情報を使用するように Spark を構成し、ノートブックで次のスニペットを使用できます。

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

where

dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") は、シークレットスコープでシークレットとして格納されているストレージアカウントアクセスキーを取得します。

資格情報を設定したら、標準の Spark API と Databricks API を使用してリソースにアクセスできます。次に例を示します。

val df = spark.read.format("parquet").load("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

dbutils.fs.ls("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

Azure Data Lake Storage Gen1 はディレクトリレベルのアクセス制御を提供します。そのため、サービスプリンシパルは、読み取り元のディレクトリと Azure Data Lake Storage Gen1 リソースにアクセスできる必要があります。

メタストア経由のアクセス

メタストアで指定された adl:// 場所にアクセスするには、対応する Hadoop 構成キーに spark.hadoop. プレフィックスを追加して、メタストアで使用される Hadoop 構成に反映することで、クラスターを作成するときに Hadoop 資格情報構成オプションを Spark オプションとして指定する必要があります。

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.oauth2.client.id <application-id>
spark.hadoop.fs.adl.oauth2.credential <service-credential>
spark.hadoop.fs.adl.oauth2.refresh.url https://login.microsoftonline.com/<directory-id>/oauth2/token

警告

これらの資格情報は、クラスターにアクセスするすべてのユーザーが使用できます。

Azure Data Lake Storage Gen1 リソースまたはフォルダーをマウントする

Azure Data Lake Storage Gen1 リソースまたはその中のフォルダーをマウントするには、次のコマンドを使用します。

Python

configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",
          "fs.adl.oauth2.client.id": "<application-id>",
          "fs.adl.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
          "fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Scala

val configs = Map(
  "fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  "fs.adl.oauth2.client.id" -> "<application-id>",
  "fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
  "fs.adl.oauth2.refresh.url" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)

where

<mount-name>は、Azure Data Lake Storage Gen1 アカウントまたはその中のフォルダー (sourceで指定) が DBFS にマウントされる場所を表す DBFS パスです。
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") は、シークレットスコープでシークレットとして格納されているストレージアカウントアクセスキーを取得します。

コンテナー内のファイルに、ローカルファイルの場合と同様にアクセスします。次に例を示します。

Python

df = spark.read.format("text").load("/mnt/<mount-name>/....")
df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

Scala

val df = spark.read.format("text").load("/mnt/<mount-name>/....")
val df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

複数のアカウントのサービス資格情報を設定する

account.<account-name> に構成キーを追加して、複数の Azure Data Lake Storage Gen1 アカウントのサービス資格情報を 1 つの Spark セッションで使用するように設定できます。たとえば、adl://example1.azuredatalakestore.net と adl://example2.azuredatalakestore.net の両方のアカウントにアクセスする資格情報を設定する場合は、次のようにします。

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")

spark.conf.set("fs.adl.account.example1.oauth2.client.id", "<application-id-example1>")
spark.conf.set("fs.adl.account.example1.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example1>"))
spark.conf.set("fs.adl.account.example1.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example1>/oauth2/token")

spark.conf.set("fs.adl.account.example2.oauth2.client.id", "<application-id-example2>")
spark.conf.set("fs.adl.account.example2.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example2>"))
spark.conf.set("fs.adl.account.example2.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example2>/oauth2/token")

これは、クラスターの Spark 構成にも使用できます。

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential

spark.hadoop.fs.adl.account.example1.oauth2.client.id <application-id-example1>
spark.hadoop.fs.adl.account.example1.oauth2.credential <service-credential-example1>
spark.hadoop.fs.adl.account.example1.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example1>/oauth2/token

spark.hadoop.fs.adl.account.example2.oauth2.client.id <application-id-example2>
spark.hadoop.fs.adl.account.example2.oauth2.credential <service-credential-example2>
spark.hadoop.fs.adl.account.example2.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example2>/oauth2/token

次のノートブックでは、Azure Data Lake Storage Gen1 に直接アクセスし、マウントを使用する方法を示します。

ADLS Gen1 サービスプリンシパルノートブック

ノートブックを入手

Azure Data Lake Storage Gen1 に Azure Databricks からアクセスする

Microsoft Entra ID (旧称 Azure Active Directory) の資格情報を使用して自動的にアクセスする

サービス プリンシパルを作成してアクセス許可を付与する

サービス プリンシパルと OAuth 2.0 を使用して Spark API で直接アクセスする

メタストア経由のアクセス

Azure Data Lake Storage Gen1 リソースまたはフォルダーをマウントする

Python

Scala

Python

Scala

複数のアカウントのサービス資格情報を設定する

ADLS Gen1 サービス プリンシパル ノートブック

その他のリソース

サービスプリンシパルを作成してアクセス許可を付与する

サービスプリンシパルと OAuth 2.0 を使用して Spark API で直接アクセスする

ADLS Gen1 サービスプリンシパルノートブック