Hi @Binhan Xi
I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others "I'll repost your solution in case you'd like to accept the answer.
Ask: I have a Synapse workspace notebook that is reading data from ADLS Gen2. I have created a linked service in Synapse workspace to ADLS Gen2 using SPI + certificate. However, when I tried to do authentication in my notebook following the documentation in this link, my notebook ran for a long time and finally failed with a Py4JJavaError. The error seems to be due to an HTTP read timeout, and I suspect it could be related to the network or an HTTP timeout configuration. I'm not sure where I can configure the network or anything related.
Could someone help me understand and resolve this issue?
Py4JJavaError: An error occurred while calling o4168.load. : java.util.concurrent.ExecutionException: Status code: -1 error code: null error message: Auth failure: HTTP Error -1CustomTokenProvider getAccessToken threw java.io.IOException : Read timed outorg.apache.hadoop.fs.azurebfs.oauth2.AzureADAuthenticator$HttpException: HTTP Error -1CustomTokenProvider getAccessToken threw java.io.IOException : Read timed out at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) ......................................... at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) ... 37 more Caused by: org.apache.hadoop.fs.azurebfs.oauth2.AzureADAuthenticator$HttpException: HTTP Error -1CustomTokenProvider getAccessToken threw java.io.IOException : Read timed out at ... 136 more
Here is the doc
Here is the code that I am using:
from delta.tables import DeltaTable
input_storage_account_name = "mssalesfdlakeprod.dfs.core.windows.net"
spark.conf.set(f"spark.storage.synapse.{input_storage_account_name}.linkedServiceName", "MSSalesFDLProd")
sc._jsc.hadoopConfiguration().set(f"fs.azure.account.oauth.provider.type.{input_storage_account_name}", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
input_path = "abfss://securezone@mssalesfdlakeprod.dfs.core.windows.net/Domain/dbo.BillingStatus/"
df = spark.read.format("delta").load(input_path)
display(df)
Solution: Unfortunately, increasing the timeout settings does not work for me. However, I found a method to temporarily mitigate the issue: every time I started a new spark session, I need to modify my linked service (say, change the authentication from SPI + cert to SPI + secret), publish it, change it back, and publish it again. And then the spark notebook runs successfully.
I think this is so strange, and this cannot be a good solution because finally we will automatically run the notebook to work as a scheduled pipeline for moving data. My current solution does not work in that situation.
BTW, I think the fact that the notebook can run normally after my linked service change should prove that the linked service using SPI + cert itself should not have problems.
If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.
If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.
Please don’t forget to Accept Answer
and Yes
for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members