GeoSpatial with SparkSQL/Python in Synapse Spark Pool using apache-sedona?

Question

I would like to run spatial queries on large data sets; e.g. geopandas would be too slow. Inspiration I found here: https://anant-sharma.medium.com/apache-sedona-geospark-using-pyspark-e60485318fbe
But I have trouble registering the spatial functions I would like to use in SparkSQL (or PySpark).

In Spark Pool of Synapse Analytics I prepared (via Azure Portal):
Apache Spark Pool / Settings / Packages / Requirement files / requirement.txt: apache-sedona

Apache Spark Pool / Settings / Packages / Workspace packages:
geotools-wrapper-geotools-24.1.jar
sedona-sql-3.0_2.12-1.2.0-incubating.jar

Apache Spark Pool / Settings / Packages / Spark configuration / config.txt:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator

Pyspark Notebook:

print(spark.version)
print(spark.conf.get("spark.kryo.registrator"))
print(spark.conf.get("spark.serializer"))

Print output from notebook:
3.1.2.5.0-58001107
org.apache.sedona.core.serde.SedonaKryoRegistrator
org.apache.spark.serializer.KryoSerializer

Then trying:

from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator  
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
spark = SparkSession.builder.master("local[*]").appName("Sedona App").config("spark.serializer", KryoSerializer.getName).config("spark.kryo.registrator", SedonaKryoRegistrator.getName).getOrCreate()
SedonaRegistrator.registerAll(spark)

But it failed: Py4JJavaError: An error occurred while calling o636.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo

A simple check that stuff is correctly installed would probaly allow this:

%%sql
SELECT ST_Point(0,0);

Please help with getting the spatial functions registered in pyspark running in Synapse notebook!

Accepted Answer

Hello @BjornD Jensen ,

Thanks for the question and using MS Q&A platform.

As per the repro from my end, I'm able to successfully run the above commands without any issue.

I just installed the requirement[dot]txt file and downloaded below two jar files:

sedona-python-adapter-3.0_2.12–1.0.0-incubating.jar
geotools-wrapper-geotools-24.0.jar

Note: config[dot]txt file is not required.

If you are still facing the same error message, I would request you to share the complete stack trace of the error message which you are experiencing.

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Answer

Turns out I used the wrong jar... :-0
I can continue now. Thanks for helping!
But let me know if you have an hint about how to get the total list of available spatial functions in the particular spark session.

Here a refined version that seems to work (-:

Uploading workspace packages (2 jar’s) in Synapse Studio / Manage / Configuration+libraries/Workspace packages:
geotools-wrapper-geotools-24.1.jar (downloaded from https://mvnrepository.com/artifact/org.datasyslab/geotools-wrapper/geotools-24.1 )
sedona-python-adapter-3.0_2.12-1.0.0-incubating.jar (downloaded from https://search.maven.org/artifact/org.apache.sedona/sedona-python-adapter-3.0_2.12/1.0.0-incubating/jar )

Then in
Apache Spark Pool / Settings / Packages / Workspace packages : selecting the above workspace packages

Uploading txt file:
Apache Spark Pool / Settings / Packages / Requirement files / requirements.txt : apache-sedona

Further uploading config.txt:
Apache Spark Pool / Settings / Packages / Spark configuration / config.txt:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator

The above configuration stuff allows me write fewer lines:

from sedona.register import SedonaRegistrator  
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
SedonaRegistrator.registerAll(spark)
print(spark.version)
print(spark.conf.get("spark.kryo.registrator"))
print(spark.conf.get("spark.serializer"))

And now the fun starts:

%%sql
SELECT st_point(0.0,0.0);

Share via

GeoSpatial with SparkSQL/Python in Synapse Spark Pool using apache-sedona?

1 additional answer