Azure Synapse the same notebook runs fine on its own but would fail when in a pipeline

Chris Zhang 36 Reputation points
2022-03-23T02:43:27.597+00:00

Trying to read data from an Azure Data Lake Storage Gen 1 added through 'connect to external data'

df = spark.read.load('adl://udc-c09.azuredatalakestore.net/Demo.tsv', format='csv', delimiter ='\t'

The above code works fine when I run inside the notebook, but as soon as I run the same notebook inside a pipeline it gives the error below.
My suspicion is that when running directly from the notebook, it is running under my user. When run from a pipeline, the pipeline doesn't know who is running. Is there a way to fix this verification issue?

Error
{
"errorCode": "6002",
"message": "Py4JJavaError: An error occurred while calling o228.load.\n: org.apache.hadoop.security.AccessControlException: GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [43dd9afd-e3d4-4c01-a9be-23d93767b5f7] failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [43dd9afd-e3d4-4c01-a9be-23d93767b5f7][2022-03-22T19:26:29.2544901-07:00] [ServerRequestId:43dd9afd-e3d4-4c01-a9be-23d93767b5f7]\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat com.microsoft.azure.datalake.store.ADLStoreClient.getRemoteException(ADLStoreClient.java:1299)\n\tat com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1264)\n\tat com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:815)\n\tat org.apache.hadoop.fs.adl.AdlFileSystem.getFileStatus(AdlFileSystem.java:504)\n\tat org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1713)\n\tat org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)\n\tat org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)\n\tat org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)\n\tat org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)\n\tat scala.Option.getOrElse(Option.scala:189)\n\tat org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)\n\tat org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n\nTraceback (most recent call last):\n\n File \"/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 204, in load\n return self._df(self._jreader.load(path))\n\n File \"/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py\", line 1304, in call\n return_value = get_return_value(\n\n File \"/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 111, in deco\n return f(*a, **kw)\n\n File \"/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/protocol.py\", line 326, in get_return_value\n raise Py4JJavaError(\n\npy4j.protocol.Py4JJavaError: An error occurred while calling o228.load.\n: org.apache.hadoop.security.AccessControlException: GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [43dd9afd-e3d4-4c01-a9be-23d93767b5f7] failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [43dd9afd-e3d4-4c01-a9be-23d93767b5f7][2022-03-22T19:26:29.2544901-07:00] [ServerRequestId:43dd9afd-e3d4-4c01-a9be-23d93767b5f7]\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat com.microsoft.azure.datalake.store.ADLStoreClient.getRemoteException(ADLStoreClient.java:1299)\n\tat com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1264)\n\tat com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:815)\n\tat org.apache.hadoop.fs.adl.AdlFileSystem.getFileStatus(AdlFileSystem.java:504)\n\tat org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1713)\n\tat org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)\n\tat org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)\n\tat org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)\n\tat org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)\n\tat scala.Option.getOrElse(Option.scala:189)\n\tat org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)\n\tat org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat java.lang.Thread.run(Thread.java:748)\n\n",
"failureType": "UserError",
"target": "Runbook",
"details": []
}

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,396 questions
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 77,751 Reputation points Microsoft Employee
    2022-03-25T05:49:06.677+00:00

    Hello @Chris Zhang ,

    Thanks for the question and using MS Q&A platform.

    You may follow the below steps to grant the service principal or the managed identity with proper permission in ADLS Gen1 storage accounts.

    182039-synapse-adlsgen1acl.gif

    • As source: In Data explorer > Access, grant at least Execute permission for ALL upstream folders including the root, along with Read permission for the files to copy. You can choose to add to This folder and all children for recursive, and add as an access permission and a default permission entry. There's no requirement on account-level access control (IAM).
    • As sink: In Data explorer > Access, grant at least Execute permission for ALL upstream folders including the root, along with Write permission for the sink folder. You can choose to add to This folder and all children for recursive, and add as an access permission and a default permission entry.

    182113-image.png

    For more details, refer to Use service principal authentication in ADLS Gen1 accounts.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

1 additional answer

Sort by: Most helpful
  1. Pratik Somaiya 4,201 Reputation points
    2022-03-23T04:07:46.947+00:00

    Hello @Chris Zhang

    You are right, this is an access issue and you need to resolve it by providing access to your ADLS

    The error says it is not able to list the files present in ADLS, which means it doesn't have READ access to your files

    You need to provide access to ADLS in the ADF / Databricks Access Control (IAM) and provide a proper access as below:

    185830-image.png

    1 person found this answer helpful.