Automated ML (interface) - maximum allowed size of parquet file exceeded

Linus Östlund 20 Reputation points
2024-08-21T07:25:27.6166667+00:00

I have created an MLTable with SDK v2. I can read it in my workspace, and convert it to a df without any problems. My goal is to use the Automated ML for this dataset. When I try to load the data, I get the following:

Error loading data previewFor the provided parquet file, row group size 124.01 MB exceeds the maximum allowed size of 20.97 MB. Please use MLTable sdk to explore the data or for legacy dataset generate a full profile to continue.

Clicking "More details", I can read:

ScriptExecution.StreamAccess.Validation.ParquetSize-ParquetReadSizeLimit For the provided parquet file, row group size 124.01 MB exceeds the maximum allowed size of 20.97 MB. Please use MLTable sdk to explore the data or for legacy dataset generate a full profile to continue. Properties Stack traceError: Request failed with status code 400 at lO (https://ml.azure.com/assets/index-xxxxxx.js:210:4072) at cO (https://ml.azure.com/assets/index-xxxxxx.js:210:4257) at XMLHttpRequest.D (https://ml.azure.com/assets/index-xxxxxxx.js:211:1665)

Googling the error sends me to an GH issue, found here:

https://github.com/Azure/azure-sdk-for-python/issues/31320

It is concluded with:

Update to azureml-dataprep 4.12.9 (https://pypi.org/project/azureml-dataprep/4.12.9/) should fix this issue. Thanks

Could it be that the Automated ML backend still is running the older version? My dataset is not that big.

Any solutions or workarounds I could try?

Thanks,
L

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,846 questions
{count} votes

Accepted answer
  1. santoshkc 7,865 Reputation points Microsoft Vendor
    2024-08-27T14:37:27.1633333+00:00

    Hi @Linus Östlund,

    I'm glad to hear that your issue has been resolved. And thanks for sharing the information, which might be beneficial to other community members reading this thread as solution. Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", so I'll repost your solution in case you'd like to accept the answer.

    Question: Automated ML (interface) - maximum allowed size of parquet file exceeded.

    Solution: I solved it by partitioning the data in ADF. Then, I use globbing patterings to load multiple folders with my MLTable.

    This will help other users who may have a similar query find the solution more easily. If you have any further questions or concerns, please don't hesitate to ask. We're always here to help.


    Do click Accept Answer and Yes for was this answer helpful.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. santoshkc 7,865 Reputation points Microsoft Vendor
    2024-08-21T10:29:24.3133333+00:00

    Hi @Linus Östlund,

    Thank you for reaching out to Microsoft Q&A forum!

    The error you're experiencing is due to the row group size of your Parquet file exceeding the allowed limit when using Automated ML in Azure Machine Learning. This can often be resolved by updating the azureml-dataprep package to the latest version. Alternatively, you can preprocess your Parquet file to split it into smaller row groups.

    I hope this helps! Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.