Identifying and resolving row size limit issue in Synapse ingestion with PySpark

Question

Identifying and resolving row size limit issue in Synapse ingestion with PySpark

Gabriel25 525

We brought in a parquet file using PySpark and put it into Synapse. However, our dataframe has records that are too big for Synapse (polybase), which has a 1MB limit. Our Databricks scripts show an error saying:

'The size of the schema/row at ordinal 'n' exceeds the maximum allowed row size of 1000000 bytes.'

I'm trying to figure out which row in my dataframe is causing this issue, but I can't identify the problematic row. I managed to print the length of each column, but how can I print the size of each record? Is there a way to do this? Can someone please help?

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2024-03-03T16:24:52.7966667+00:00

@Vikranth-AI Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

0 additional answers

Your answer

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2024-03-03T16:24:52.7966667+00:00

@Vikranth-AI Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi @Vikranth-AI

Thank you for using Microsoft Q&A platform and thanks for your question.

I understand that you are facing an issue with Synapse ingestion with PySpark. The error message you are seeing is due to the row size limit of 1MB in Synapse (polybase).

Use below code to get size of each row.

import sys
rows = df.collect()
for rw in rows:
    print(str((sys.getsizeof(''.join(rw[0:]))))+" bytes")

This gives you size in bytes as below:

enter image description here

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Identifying and resolving row size limit issue in Synapse ingestion with PySpark

0 additional answers

Your answer