Time out error while reading parquet files in Azure Synapse Notebook.

Daniel Adaji 5 Reputation points Microsoft Employee
2023-02-27T13:05:13.51+00:00

Getting the error below while reading parquet files.

An error occurred while calling o34901.parquet.
: Status code: -1 error code: null error message: InvalidAbfsRestOperationExceptionjava.net.SocketTimeoutException: Read timed out
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:278)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:186)
	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:229)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:944)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:914)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:896)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:405)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
	at org.apache.spark.util.BlobListingFileSystemWrapper.listStatus(BlobListingFileSystemWrapper.scala:56)
	at org.apache.spark.util.BlobListingFileSystemWrapper.listStatus(BlobListingFileSystemWrapper.scala:70)
	at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
	at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
	at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:158)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:131)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:94)
	at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:66)
	at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:565)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:409)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:281)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:253)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:253)
	at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:608)
	at sun.reflect.GeneratedMethodAccessor111.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at org.wildfly.openssl.OpenSSLSocket.read(OpenSSLSocket.java:423)
	at org.wildfly.openssl.OpenSSLInputStream.read(OpenSSLInputStream.java:41)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:743)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1595)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1500)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
	at org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:330)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:249)
	... 42 more
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,369 questions
{count} vote

1 answer

Sort by: Most helpful
  1. BhargavaGunnam-MSFT 26,136 Reputation points Microsoft Employee
    2023-03-06T22:45:41.2366667+00:00

    Hello Daniel Adaji,

    Welcome to the MS Q&A platform.

    Per the error message "InvalidAbfsRestOperationExceptionjava.net.SocketTimeoutException: Read timed out" the read operation from the Azure Blob Storage timed out. This can be caused by a slow network connection or a large amount of data being transferred.

    You can try increasing the timeout value for the read operation by setting the spark.hadoop.fs.azure.read.request.timeout configuration property in Synapse. The default value is 120 seconds. You can increase it to a higher value, such as 300 seconds, by adding the following line of code to your Spark code:

    spark.conf.set("spark.hadoop.fs.azure.read.request.timeout", "300")
    

    You can also try increasing the number of retries for the read operation by setting the spark.hadoop.fs.azure.read.request.maxRetries Configuration property. The default value is 4. You can increase it to a higher value, such as 8, by adding the following line of code to your Spark code:

    spark.conf.set("spark.hadoop.fs.azure.read.request.maxRetries", "8")
    
    

    I hope this helps. Please let us know if you have any further questions.

    0 comments No comments