Azure Synapse the same notebook runs fine on its own but would fail when in a pipeline

Question

Trying to read data from an Azure Data Lake Storage Gen 1 added through 'connect to external data'

df = spark.read.load('adl://udc-c09.azuredatalakestore.net/Demo.tsv', format='csv', delimiter ='	'

The above code works fine when I run inside the notebook, but as soon as I run the same notebook inside a pipeline it gives the error below.
My suspicion is that when running directly from the notebook, it is running under my user. When run from a pipeline, the pipeline doesn't know who is running. Is there a way to fix this verification issue?

Error
{
"errorCode": "6002",
"message": "Py4JJavaError: An error occurred while calling o228.load. : org.apache.hadoop.security.AccessControlException: GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [43dd9afd-e3d4-4c01-a9be-23d93767b5f7] failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [43dd9afd-e3d4-4c01-a9be-23d93767b5f7][2022-03-22T19:26:29.2544901-07:00] [ServerRequestId:43dd9afd-e3d4-4c01-a9be-23d93767b5f7] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.microsoft.azure.datalake.store.ADLStoreClient.getRemoteException(ADLStoreClient.java:1299) at com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1264) at com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:815) at org.apache.hadoop.fs.adl.AdlFileSystem.getFileStatus(AdlFileSystem.java:504) at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1713) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Traceback (most recent call last): File \"/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 204, in load return self._df(self._jreader.load(path)) File \"/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py\", line 1304, in call return_value = get_return_value( File \"/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 111, in deco return f(*a, **kw) File \"/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/protocol.py\", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o228.load. : org.apache.hadoop.security.AccessControlException: GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [43dd9afd-e3d4-4c01-a9be-23d93767b5f7] failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.). [43dd9afd-e3d4-4c01-a9be-23d93767b5f7][2022-03-22T19:26:29.2544901-07:00] [ServerRequestId:43dd9afd-e3d4-4c01-a9be-23d93767b5f7] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.microsoft.azure.datalake.store.ADLStoreClient.getRemoteException(ADLStoreClient.java:1299) at com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1264) at com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:815) at org.apache.hadoop.fs.adl.AdlFileSystem.getFileStatus(AdlFileSystem.java:504) at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1713) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ",
"failureType": "UserError",
"target": "Runbook",
"details": []
}

Accepted Answer

Hello @Chris Zhang ,

Thanks for the question and using MS Q&A platform.

You may follow the below steps to grant the service principal or the managed identity with proper permission in ADLS Gen1 storage accounts.

As source: In Data explorer > Access, grant at least Execute permission for ALL upstream folders including the root, along with Read permission for the files to copy. You can choose to add to This folder and all children for recursive, and add as an access permission and a default permission entry. There's no requirement on account-level access control (IAM).
As sink: In Data explorer > Access, grant at least Execute permission for ALL upstream folders including the root, along with Write permission for the sink folder. You can choose to add to This folder and all children for recursive, and add as an access permission and a default permission entry.

For more details, refer to Use service principal authentication in ADLS Gen1 accounts.

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Answer

Hello @Chris Zhang

You are right, this is an access issue and you need to resolve it by providing access to your ADLS

The error says it is not able to list the files present in ADLS, which means it doesn't have READ access to your files

You need to provide access to ADLS in the ADF / Databricks Access Control (IAM) and provide a proper access as below:

Azure Synapse the same notebook runs fine on its own but would fail when in a pipeline

1 additional answer