Unable to perform spark.sql in Synapse notebook

Austin Schafer 96

Hello,

I am unable to run a simple spark.sql() (ex. df = spark.sql("SELECT * FROM table1")) in Synapse notebooks. I am able to load and view the file without using SQL, but when using spark.sql() I receive errors for all files including csv and parquet file types.

I have tried different sized clusters, restarting clusters, spark versions, and changing the language and code from PySpark to Scala. My workspace has permission to access my data in ADLS Gen 2. Apologies if this question has already been answered elsewhere. Below is the error I am receiving.

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException;
Traceback (most recent call last):

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)

File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)

pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException;

Thanks

Ryan Abbey 1,181 Reputation points

2021-07-21T20:25:45.147+00:00

To ask the obvious... are you sure the table exists? In the "Data" screen, can you find the table?
Austin Schafer 96 Reputation points

2021-07-21T21:15:44.667+00:00

Yes, but thanks for checking.

I can load it with PySpark and run a .show() just fine. I just receive errors when running .sql().

Accepted answer

Austin Schafer 96 Reputation points

2021-08-04T17:52:51.997+00:00

Posting the solution I was given after contacting support:

This is a bug that sometimes occurs when the workspace is created. After I created a new workspace and ran the same commands, the code worked great.
Please sign in to rate this answer.
MartinJaffer-MSFT 26,086 Reputation points

2021-08-04T21:52:33.443+00:00

Thank you for sharing the learnings!

Computer Mike 86 Reputation points

2022-02-04T17:36:58.587+00:00

I am using the "Getting Started with Delta Lake" from the Gallery, that Microsoft created. I am getting an error in cell 33, see below. I didn't really do anything except run the notebook a cell at atime. I had already created a delta database from a very simple script.

I can't believe the answer is to create a new workspace. For me that would be allot of work. Is there no way to fix.

Does anyone have a better answer?

Thanks,

Mike

AnalysisException: Table default.ManagedDeltaTable already exists Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1158, in saveAsTable self._jwrite.saveAsTable(name) File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__ return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Table default.ManagedDeltaTable already exists
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

2 additional answers

Ryan Abbey 1,181 Reputation points

2021-07-21T21:55:11.467+00:00

how are you loading in PySpark? Via the forPath? Did you do a "saveAsTable" on creation? (or any subsequent table creation command)?
Please sign in to rate this answer.
Austin Schafer 96 Reputation points

2021-07-21T23:08:09.56+00:00

I am loading with the folder path. Such as the following:

path = 'abfss://<container>@<account>.dfs.core.windows.net/<path>

df = spark.read.option('header', 'true') \
.option('delimiter', ',') \
.csv(path)

dfSQL = spark.sql("SELECT * FROM df")

Austin Schafer 96 Reputation points

2021-07-21T23:10:41.633+00:00

The parquet file was loaded with this:

parquetFile = spark.read.parquet("<path>")

parquetFile.createOrReplaceTempView("parquetFile")
df = spark.sql("SELECT * FROM parquetFile")
df.show()

If I removed df = spark.sql("SELECT * FROM parquetFile"), and ran parquetFile.show() it would load.

Ryan Abbey 1,181 Reputation points

2021-07-21T23:40:39.117+00:00

ahh, no, you can't select from a dataframe in that way

If you want to use SQL on a dataframe, then as you did in subsequent comment you can df.createOrReplaceTempView(<name of temp view>) and then you can do "spark.sql('select * from <name of temp view>')"

Otherwise you can use the dataframe directly by doing df.select(...) or df.filter(...) etc

Ryan Abbey 1,181 Reputation points

2021-07-21T23:42:09.233+00:00

parquet file in "parquetFile.show()" is your daaframe (from "parquetFile = spark.read.parquet("<path>")") not the temp view

Austin Schafer 96 Reputation points

2021-07-22T16:27:37.293+00:00

Hmmm, are you suggesting something like the following?

parquetFile = spark.read.parquet("<path>")

parquetFile.createOrReplaceTempView("df")
df2 = spark.sql("SELECT * FROM df")
df2.show()

This seems to shoot me back an error as well. Sorry for all the trouble with something simple!

Ryan Abbey 1,181 Reputation points

2021-07-22T20:07:46.247+00:00

What's the new error?

Start off with

parquetFile = spark.read.parquet(<path>)
display(parquetFile)

do you get a result?

Ryan Abbey 1,181 Reputation points

2021-07-22T20:09:02.797+00:00

Do note that parquetFile that you've assigned is a dataframe, you don't need to do the df2 bit because you're affectively just assigning parquetFile to df2

Austin Schafer 96 Reputation points

2021-07-23T19:23:45.513+00:00

Alright this worked fine:

parquetFile = spark.read.parquet(<path>)
display(parquetFile)

Here is the error that I get when running:

parquetFile = spark.read.parquet("<path>")

parquetFile.createOrReplaceTempView("df")
df2 = spark.sql("SELECT * FROM df")
df2.show()

Error:

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException;
Traceback (most recent call last):

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)

File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)

pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException;

Ryan Abbey 1,181 Reputation points

2021-07-26T01:18:26.387+00:00

hmmm, don't know...

if you ran the below in a new cell, does it work?
%%sql
select * from df

Austin Schafer 96 Reputation points

2021-07-26T15:30:28.08+00:00

It does not. It returns a long list of org.apache.spark.sql errors.

Ryan Abbey 1,181 Reputation points

2021-07-26T19:55:56.077+00:00

I'm not spotting anything at issue with your statements, maybe a support call with Microsoft is required

Austin Schafer 96 Reputation points

2021-07-26T20:23:53.93+00:00

Thanks for trying to figure it out with me. I appreciate it! I'll give them a call.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Computer Mike 86 Reputation points

2022-02-04T17:44:54.443+00:00
I changed the code in cell 33..

# Write data to a new managed catalog table. ## old...... data.write.format("delta").saveAsTable("ManagedDeltaTable") ##new data.write.format("delta").mode("overwrite").saveAsTable("ManagedDeltaTable")
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Unable to perform spark.sql in Synapse notebook

2 additional answers

Your answer