Synapse Spark v3.4 CREATE DATABASE testdb LOCATION '<>' Not working on Spark 3.4

Question

Synapse Spark v3.4 CREATE DATABASE testdb LOCATION '<>' Not working on Spark 3.4

Arthur Steijn 56

Executing the following statement on Apache Spark runtime 3.4 results in an error. Tried to do this on different vanilla spark pool configurations on different tennants in West Europe.

database_name = 'test_sppool34'
target_location = "abfss://******@XXXXXXXX.dfs.core.windows.net/"

stmnt = f"CREATE DATABASE IF NOT EXISTS {database_name} LOCATION '{target_location}'"
spark.sql(stmnt)

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[23], line 5
      2 target_location = "abfss://******@XXXXXXXXX.dfs.core.windows.net/"
      4 stmnt = f"CREATE DATABASE IF NOT EXISTS {database_name} LOCATION '{target_location}';"
----> 5 spark.sql(stmnt)

File /opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py:1440, in SparkSession.sql(self, sqlQuery, args, **kwargs)
   1438 try:
   1439     litArgs = {k: _to_java_column(lit(v)) for k, v in (args or {}).items()}
-> 1440     return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
   1441 finally:
   1442     if len(kwargs) > 0:

File ~/cluster-env/env/lib/python3.10/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File /opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:169, in capture_sql_exception.<locals>.deco(*a, **kw)
    167 def deco(*a: Any, **kw: Any) -> Any:
    168     try:
--> 169         return f(*a, **kw)
    170     except Py4JJavaError as e:
    171         converted = convert_exception(e.java_exception)

File ~/cluster-env/env/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o277.sql.
: java.lang.NoClassDefFoundError: Could not initialize class org.json4s.jackson.Serialization$
	at com.microsoft.azure.synapse.TokenServiceClient.invokePostApi(TokenServiceClient.scala:93)
	at com.microsoft.azure.synapse.TokenServiceClient.callTokenApi(TokenServiceClient.scala:152)
	at com.microsoft.azure.synapse.tokenlibrary.TokenLibraryInternal.tokenServiceCall$1(TokenLibrary.scala:115)
	at com.microsoft.azure.synapse.tokenlibrary.TokenLibraryInternal.$anonfun$getAccessToken$4(TokenLibrary.scala:124)
	at com.microsoft.azure.synapse.tokenlibrary.TokenLibraryInternal.getFromCacheOrCallTokenService(TokenLibrary.scala:73)
	at com.microsoft.azure.synapse.tokenlibrary.TokenLibraryInternal.getAccessToken(TokenLibrary.scala:124)
	at com.microsoft.azure.synapse.tokenlibrary.TokenLibrary$.getAccessToken(TokenLibrary.scala:468)
	at com.microsoft.azure.synapse.tokenlibrary.SessionTokenBasedTokenProvider.$anonfun$getAccessToken$1(SessionTokenBasedTokenProvider.scala:128)
	at scala.util.Try$.apply(Try.scala:213)
	at com.microsoft.azure.synapse.tokenlibrary.SessionTokenBasedTokenProvider.getAccessToken(SessionTokenBasedTokenProvider.scala:126)
	at org.apache.hadoop.fs.azurebfs.oauth2.CustomTokenProviderAdapter.refreshToken(CustomTokenProviderAdapter.java:74)
	at org.apache.hadoop.fs.azurebfs.oauth2.AccessTokenProvider.getToken(AccessTokenProvider.java:50)
	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAccessToken(AbfsClient.java:1055)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:256)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:217)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:191)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:464)
	at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:189)
	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:911)
	at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:892)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:421)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:1036)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:650)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:640)
	at org.apache.hadoop.hive.metastore.Warehouse.isDir(Warehouse.java:520)
	at com.microsoft.catalog.metastore.metastoreclient.HiveMetastoreClientImp.makeDirs(HiveMetastoreClientImp.java:260)
	at com.microsoft.catalog.metastore.metastoreclient.HiveMetastoreClientImp.createDatabase(HiveMetastoreClientImp.java:158)
	at com.microsoft.catalog.metastore.metastoreclient.HiveMetastoreClient.createDatabase(HiveMetastoreClient.java:762)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.microsoft.catalog.metastore.metastoreclient.PerformanceTelemetryHiveMetastoreClientInvoker.invoke(PerformanceTelemetryHiveMetastoreClientInvoker.java:26)
	at com.sun.proxy.$Proxy122.createDatabase(Unknown Source)
	at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:430)
	at org.apache.spark.sql.hive.client.Shim_v0_12.createDatabase(HiveShim.scala:574)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createDatabase$1(HiveClientImpl.scala:349)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:304)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
	at org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:346)
	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createDatabase$1(HiveExternalCatalog.scala:193)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
	at org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:193)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createDatabase(ExternalCatalogWithListener.scala:47)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:317)
	at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.createNamespace(V2SessionCatalog.scala:307)
	at org.apache.spark.sql.connector.catalog.DelegatingCatalogExtension.createNamespace(DelegatingCatalogExtension.java:163)
	at org.apache.spark.sql.execution.datasources.v2.CreateNamespaceExec.run(CreateNamespaceExec.scala:47)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:152)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:120)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:209)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:105)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:67)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:152)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:145)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:145)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:129)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:123)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:230)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:640)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:630)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:662)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:829)

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-12T20:38:41.3733333+00:00

Hello Arthur Steijn,

Can you please try creating the database with a different name? (I am checking to see if this error is related to a duplicate name.)

If you are not using the target_location as the default Synapse storage account, could you please try writing it to the default storage account and see if the error still persists?
Arthur Steijn 56 Reputation points

2024-04-15T07:13:16.34+00:00

Above code worked fine on v3.3. Running it on the new Apache Spark Version 3.4 results in the error. Tried this on different storage accounts, with different database names, with different settings on the newly created Spark Pool
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-15T18:48:16.89+00:00

Hello Arthur Steijn,

I was able to reproduce the issue with Spark 3.4 when writing to a container other than the primary one (successfully created db on primary container)

I was able to create a database with Spark 3.3 on any container.

This seems to be a bug with Spark 3.4

Please confirm the same from your end so that I can check with my internal team and get back to you.
Arthur Steijn 56 Reputation points

2024-04-16T06:17:34.6466667+00:00

Hi, @Bhargava-MSFT

This is correct. The Primairy filessystem and container is the only place the statement works!
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-16T17:25:33.04+00:00

Hello Arthur Steijn,

Thanks for the confirmation. I have reached out to the PG. I will get back to you as soon as I hear from them.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-19T19:35:08.53+00:00

Hello Arthur Steijn,

Just an update.

We have created an ICM and PG is working on the fix. I will update you once the fix is deployed.
Arthur Steijn 56 Reputation points

2024-04-30T05:14:20.72+00:00

Hello @Bhargava-MSFT ,

Any updates?
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-30T16:57:00.9366667+00:00

Hello Arthur Steijn

The ETA for the fix is currently the end of May.

As a workaround, PG is addressing the issue for customers by pinning older VHDs to workspaces and Spark pools for a limited time to temporarily mitigate the issue. The pinned VHDs will remain valid until their expiration date.

Could you please provide me with your Synapse workspace name and Spark pool details so I can work with PG to unblock the issue for you?
Dimitrios Papadoulis 0 Reputation points

2024-05-01T13:18:49.9666667+00:00

@Bhargava-MSFT hi there!

QQ: How are we supposed to save to a specific location under Spark 3.4 until this is properly fixed?
Martin B 126 Reputation points

2024-05-07T20:15:10.6966667+00:00

We are facing the same issue. Support case is open since 2024-04-08 - still no solution...
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-05-07T21:18:46.3366667+00:00

Hello Martin B,

PG has confirmed this issue is due to a bug. Until the fix is deployed, please request the support engineer to implement the workaround: Pinning older VHDs to workspaces and Spark pools.
Martin B 126 Reputation points

2024-05-08T05:56:47.0466667+00:00

Hi @Bhargava-MSFT ,
Thanks!
It is basically a "downgrade" to a prior version of the VHD, right?
Will this workaround cause any other effects on functionalities or security?
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-05-08T15:05:13.31+00:00

Hello Martin B,

Yes, you are correct. This is essentially a 'downgrade' to a previous version of the VHD.

I don't believe this change will impact any other functionalities, but please confirm with the support engineer.

Accepted answer

0 additional answers

Your answer

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-12T20:38:41.3733333+00:00

Hello Arthur Steijn,

Can you please try creating the database with a different name? (I am checking to see if this error is related to a duplicate name.)

If you are not using the target_location as the default Synapse storage account, could you please try writing it to the default storage account and see if the error still persists?
Arthur Steijn 56 Reputation points

2024-04-15T07:13:16.34+00:00

Above code worked fine on v3.3. Running it on the new Apache Spark Version 3.4 results in the error. Tried this on different storage accounts, with different database names, with different settings on the newly created Spark Pool
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-15T18:48:16.89+00:00

Hello Arthur Steijn,

I was able to reproduce the issue with Spark 3.4 when writing to a container other than the primary one (successfully created db on primary container)

I was able to create a database with Spark 3.3 on any container.

This seems to be a bug with Spark 3.4

Please confirm the same from your end so that I can check with my internal team and get back to you.
Arthur Steijn 56 Reputation points

2024-04-16T06:17:34.6466667+00:00

Hi, @Bhargava-MSFT

This is correct. The Primairy filessystem and container is the only place the statement works!
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-16T17:25:33.04+00:00

Hello Arthur Steijn,

Thanks for the confirmation. I have reached out to the PG. I will get back to you as soon as I hear from them.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-19T19:35:08.53+00:00

Hello Arthur Steijn,

Just an update.

We have created an ICM and PG is working on the fix. I will update you once the fix is deployed.
Arthur Steijn 56 Reputation points

2024-04-30T05:14:20.72+00:00

Hello @Bhargava-MSFT ,

Any updates?
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-04-30T16:57:00.9366667+00:00

Hello Arthur Steijn

The ETA for the fix is currently the end of May.

As a workaround, PG is addressing the issue for customers by pinning older VHDs to workspaces and Spark pools for a limited time to temporarily mitigate the issue. The pinned VHDs will remain valid until their expiration date.

Could you please provide me with your Synapse workspace name and Spark pool details so I can work with PG to unblock the issue for you?
Dimitrios Papadoulis 0 Reputation points

2024-05-01T13:18:49.9666667+00:00

@Bhargava-MSFT hi there!

QQ: How are we supposed to save to a specific location under Spark 3.4 until this is properly fixed?
Martin B 126 Reputation points

2024-05-07T20:15:10.6966667+00:00

We are facing the same issue. Support case is open since 2024-04-08 - still no solution...
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-05-07T21:18:46.3366667+00:00

Hello Martin B,

PG has confirmed this issue is due to a bug. Until the fix is deployed, please request the support engineer to implement the workaround: Pinning older VHDs to workspaces and Spark pools.
Martin B 126 Reputation points

2024-05-08T05:56:47.0466667+00:00

Hi @Bhargava-MSFT ,
Thanks!
It is basically a "downgrade" to a prior version of the VHD, right?
Will this workaround cause any other effects on functionalities or security?
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2024-05-08T15:05:13.31+00:00

Hello Martin B,

Yes, you are correct. This is essentially a 'downgrade' to a previous version of the VHD.

I don't believe this change will impact any other functionalities, but please confirm with the support engineer.

Answer 1

Hello Arthur Steijn

I see that your issue has been resolved by the support team. Since it has been resolved, I am sharing the workaround that the support team followed to resolve it. This will help other community users facing the same issue

Workaround:

We have an active ICM that PG is currently addressing. As a temporary solution, PG is mitigating the issue for customers by temporarily pinning older Virtual Hard Disks (VHDs) to workspaces and Spark pools. These pinned VHDs will remain valid until their expiration date.

If you have a support plan, please proceed to submit a support case. Otherwise, kindly provide me with your Synapse workspace name and Spark pool details so I can collaborate with PG to resolve the issue for you.

Share via

Synapse Spark v3.4 CREATE DATABASE testdb LOCATION '<>' Not working on Spark 3.4

0 additional answers

Your answer