Synapse Pyspark efficiently read large amount of small files from datalake

Finn Schmidt 86 Reputation points
2024-04-09T10:32:33.16+00:00

Hello,

I am trying to design an architecture that can handle processing large amounts of small files (aka "small file problem"). Using the spark.read.json method takes quite a while, as it first calls the Storage/ Datalake SDK glob method to list out all the files before even beginning to read them.

In AWS Glue there is a ´create_dynamic_frame´ method that groups files beforehand and allows for a more efficient read process (https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html). I was wondering if something similar or equivalent exists in synapse.

Thank you!

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,378 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.