Identifying and resolving row size limit issue in Synapse ingestion with PySpark

Gabriel25 525 Reputation points
2024-03-01T11:36:29.83+00:00

We brought in a parquet file using PySpark and put it into Synapse. However, our dataframe has records that are too big for Synapse (polybase), which has a 1MB limit. Our Databricks scripts show an error saying:

'The size of the schema/row at ordinal 'n' exceeds the maximum allowed row size of 1000000 bytes.'

I'm trying to figure out which row in my dataframe is causing this issue, but I can't identify the problematic row. I managed to print the length of each column, but how can I print the size of each record? Is there a way to do this? Can someone please help?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2024-03-01T12:12:12.7166667+00:00

    Hi @Vikranth-AI

    Thank you for using Microsoft Q&A platform and thanks for your question.

    I understand that you are facing an issue with Synapse ingestion with PySpark. The error message you are seeing is due to the row size limit of 1MB in Synapse (polybase).

    Use below code to get size of each row.

    import sys
    rows = df.collect()
    for rw in rows:
        print(str((sys.getsizeof(''.join(rw[0:]))))+" bytes")
    
    
    

    This gives you size in bytes as below:

    enter image description here

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.