Synapse Pyspark efficiently read large amount of small files from datalake
Finn Schmidt
86
Reputation points
Hello,
I am trying to design an architecture that can handle processing large amounts of small files (aka "small file problem"). Using the spark.read.json method takes quite a while, as it first calls the Storage/ Datalake SDK glob method to list out all the files before even beginning to read them.
In AWS Glue there is a ´create_dynamic_frame´ method that groups files beforehand and allows for a more efficient read process (https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html). I was wondering if something similar or equivalent exists in synapse.
Thank you!
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,378 questions
Sign in to answer