Pyspark Synapse pool write speed pefromance

WZIAFP 157 Reputation points
2022-09-15T15:47:23.487+00:00

I am writing a pandas dataframe to my synapse dedicated sql pool and am using sqlalchemy to write to the pool using the code below

from sqlalchemy import event  
@event.listens_for(engine, "before_cursor_execute")  
def receive_before_cursor_execute(  
       conn, cursor, statement, params, context, executemany  
        ):  
            if executemany:  
                cursor.fast_executemany = True  
  
df3.to_sql("test", engine, index=False, if_exists="append", schema="dbo")   

The dataframe df3 is 24138 rows x 12 columns, yet writing these to the synapse table is taking 41 minutes.

I've monitored my sql pool during the write and metrics are showing a peak of 3 DWU's. So I don't see scaling up the pool as a solution.
I've scaled up the spark pool also and performance has not changed.

What other factors could be causing such poor performance in write?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,369 questions
0 comments No comments
{count} votes

Accepted answer
  1. ShaikMaheer-MSFT 37,896 Reputation points Microsoft Employee
    2022-09-16T14:33:58.773+00:00

    Hi @WZIAFP ,

    Thank you for posting query in Microsoft Q&A Platform.

    I seen some articles online which says working with sqlalchemy is slow.

    There are several python SQL drivers available. However, Microsoft places its testing efforts and its confidence in pyodbc driver. Click here to know more about same.

    Could you please try using pyodbc driver and see how it behaves? Kindly refer below video to understand about pyodbc library on high level.
    Query SQL Server DB from Spark notebook in Azure Synapse Analytics | pyodbc to query SQL server

    Hope this helps. Please let me know how it goes.


0 additional answers

Sort by: Most helpful