Pyspark Synapse pool write speed pefromance

Question

I am writing a pandas dataframe to my synapse dedicated sql pool and am using sqlalchemy to write to the pool using the code below

from sqlalchemy import event  
@event.listens_for(engine, "before_cursor_execute")  
def receive_before_cursor_execute(  
       conn, cursor, statement, params, context, executemany  
        ):  
            if executemany:  
                cursor.fast_executemany = True  
  
df3.to_sql("test", engine, index=False, if_exists="append", schema="dbo")

The dataframe df3 is 24138 rows x 12 columns, yet writing these to the synapse table is taking 41 minutes.

I've monitored my sql pool during the write and metrics are showing a peak of 3 DWU's. So I don't see scaling up the pool as a solution.
I've scaled up the spark pool also and performance has not changed.

What other factors could be causing such poor performance in write?

Accepted Answer

Hi @WZIAFP ,

Thank you for posting query in Microsoft Q&A Platform.

I seen some articles online which says working with sqlalchemy is slow.

There are several python SQL drivers available. However, Microsoft places its testing efforts and its confidence in pyodbc driver. Click here to know more about same.

Could you please try using pyodbc driver and see how it behaves? Kindly refer below video to understand about pyodbc library on high level.
Query SQL Server DB from Spark notebook in Azure Synapse Analytics | pyodbc to query SQL server

Hope this helps. Please let me know how it goes.

Pyspark Synapse pool write speed pefromance

0 additional answers