Synapse Pyspark efficiently read large amount of small files from datalake

Question

Synapse Pyspark efficiently read large amount of small files from datalake

Finn Schmidt 86

Hello,

I am trying to design an architecture that can handle processing large amounts of small files (aka "small file problem"). Using the spark.read.json method takes quite a while, as it first calls the Storage/ Datalake SDK glob method to list out all the files before even beginning to read them.

In AWS Glue there is a ´create_dynamic_frame´ method that groups files beforehand and allows for a more efficient read process (https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html). I was wondering if something similar or equivalent exists in synapse.

Thank you!

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-09T15:09:24.88+00:00
@Finn Schmidt

Thanks for reaching out to Microsoft Q&A

In Synapse, there isn’t a direct equivalent to AWS Glue’s create_dynamic_frame method. However, you can achieve similar functionality

here are some approaches you can consider

Leverage Apache Spark Features:

Partition Pruning: Spark can automatically prune partitions (subfolders) that don't contain data matching your schema. This reduces the number of files Spark needs to process. You can define your schema upfront or infer it during read.

Wildcards: You can use wildcards in your file path during spark.read.json to avoid listing individual files. For example, df = spark.read.json("path/to/data/*.json"). This reduces the initial listing overhead.

Preprocessing with Azure Data Factory (ADF):

Use ADF to preprocess your data. ADF can group small files into larger ones before loading them into Synapse. This can be done with tools like "Copy Activity" and compression settings.

You can use ADF to orchestrate the processing of small files by creating a pipeline that performs the following steps:

List files: Use the “Get Metadata” activity to retrieve a list of files from your storage location (e.g., Azure Blob Storage, Data Lake Storage).

Group files: Implement custom logic (e.g., based on file names, timestamps, or other attributes) to group the files into larger chunks.

Parallelize processing: Use the “ForEach” activity to iterate over the grouped file sets and process them in parallel. Consider Delta Lake: If you're using Delta Lake for your data storage, you can leverage its capabilities for efficient reads. Delta Lake automatically optimizes file organization and avoids listing every small file.

Here are some additional resources that might be helpful:

Spark Partition Pruning: https://medium.com/geekculture/dynamic-partition-pruning-baf3270694b4

Azure Data Factory Copy Activity: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview

Delta Lake: https://docs.databricks.com/en/delta/index.html

By implementing one or a combination of these approaches, you can improve the processing efficiency for handling small files in Azure Synapse Analytics.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Finn Schmidt 86 Reputation points

2024-04-10T09:56:46.3566667+00:00

Hi @phemanth
Thank you for your response!

I believe the spark features won't be that helpful to me unfortunately, as even with partition pruning and wildcars, the number of files to be read is still quite large.
The pre-processing step in ADF sounds feasible though.

How would you go about grouping the files from the get metadata activity into batches? It seems to me that there's no native activity capable of doing that. Is there some way to e.g. call a simple python snippet to do that?

Also, could you please elaborate on your point regarding delta lake? The pyspark process will merge the dataframe into delta tables as the last execution step, but that still requires reading all the json files into a dataframe initially, or am I missing something obvious here?
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-11T06:10:12.2933333+00:00
@Finn Schmidt While Spark features help, they might not be enough for tons of small files. Here's a simpler approach:

Pre-process with Azure Data Factory (ADF):

Use ADF to group small files into larger ones before loading them into Synapse Analytics.

ADF's "Get Metadata" activity can list your files.

No built-in grouping activity exists, but you can use:

Set Variable Activity: Loop through files, conditionally add them to a group based on criteria (e.g., size, date) using scripting (PowerShell or Python).

Data Flows: Visually design data transformations. Split the file list based on your grouping criteria.

Calling a Python Snippet:

Yes, you can leverage a "Web Activity" or a "Python Activity" in ADF to execute a Python script for grouping logic.

This script can receive the list of files from the "Get Metadata" activity and implement your desired grouping logic using libraries like pandas.

Once grouped, the scr

Delta Lake:

Consider storing data in Delta Lake.

It automatically organizes files for efficiency and only reads modified data in future runs, reducing redundant reads.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-12T05:14:52.4233333+00:00

@Finn Schmidt We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Finn Schmidt 86 Reputation points

2024-04-12T07:26:58.2966667+00:00

@phemanth
Sorry for the delay, I have only now gotten around to running some tests using ADF. While it seems to work for json files with simple structures, the execution fails when the json body contains arrays or nested elements (even with the option to map complex elements as strings enabled).
Moreover the copy activity sometimes just gets stuck, and eventually runs up quite a bill. Overall it seems a bit unreliable and clunky, unfortunately.

I will try to preprocess the files using CETAS statements in the serverless SQL pool and/or container jobs running python next.

Thank you for your help!
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-15T09:20:43.4333333+00:00

@Finn Schmidt just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Finn Schmidt 86 Reputation points

2024-04-15T10:51:49.8366667+00:00

@phemanth

For the moment I will build a solution around regularly creating external tables for the newest json files using the serverless SQL pool, and then querying these external tables from within pyspark, running the needed transformation and writing to delta tables in the end. Once the pyspark process finishes the external tables can then be dropped, as they only serve to enable faster reads for pyspark and are no longer needed afterwards.

That seems to be the most effective option available (both in terms of compute and costs), but a robust retry strategy on the cetas execution will be needed, as the serverless sql pool sometimes simply times out or fails with "internal error" when creating external tables, and may block a secondary run from executing because the write location on the datalake now already exists (so the retry must include a step to clean up the directory from the failed first run beforehand).

Thank you for your support! Perhaps a future version of synapse spark will include a spark native solution to this issue.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-16T09:02:52.2533333+00:00

@Finn Schmidt did my response helped you ? Do let us know if you any further queries.
Finn Schmidt 86 Reputation points

2024-04-16T09:08:38.4633333+00:00

@phemanth

I'm a bit confused, what is this in response to?
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-17T08:26:18.4633333+00:00

@Finn Schmidt i mean did any of my information that is provided did helped you or do you have further questions?

Your answer

Finn Schmidt 86 Reputation points

2024-04-10T09:56:46.3566667+00:00

Hi @phemanth
Thank you for your response!

I believe the spark features won't be that helpful to me unfortunately, as even with partition pruning and wildcars, the number of files to be read is still quite large.
The pre-processing step in ADF sounds feasible though.

How would you go about grouping the files from the get metadata activity into batches? It seems to me that there's no native activity capable of doing that. Is there some way to e.g. call a simple python snippet to do that?

Also, could you please elaborate on your point regarding delta lake? The pyspark process will merge the dataframe into delta tables as the last execution step, but that still requires reading all the json files into a dataframe initially, or am I missing something obvious here?
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-11T06:10:12.2933333+00:00

@Finn Schmidt While Spark features help, they might not be enough for tons of small files. Here's a simpler approach:

Pre-process with Azure Data Factory (ADF):

Use ADF to group small files into larger ones before loading them into Synapse Analytics.

ADF's "Get Metadata" activity can list your files.

No built-in grouping activity exists, but you can use:

Set Variable Activity: Loop through files, conditionally add them to a group based on criteria (e.g., size, date) using scripting (PowerShell or Python).

Data Flows: Visually design data transformations. Split the file list based on your grouping criteria.

Calling a Python Snippet:

Yes, you can leverage a "Web Activity" or a "Python Activity" in ADF to execute a Python script for grouping logic.

This script can receive the list of files from the "Get Metadata" activity and implement your desired grouping logic using libraries like pandas.

Once grouped, the scr

Delta Lake:

Consider storing data in Delta Lake.

It automatically organizes files for efficiency and only reads modified data in future runs, reducing redundant reads.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-12T05:14:52.4233333+00:00

@Finn Schmidt We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Finn Schmidt 86 Reputation points

2024-04-12T07:26:58.2966667+00:00

@phemanth
Sorry for the delay, I have only now gotten around to running some tests using ADF. While it seems to work for json files with simple structures, the execution fails when the json body contains arrays or nested elements (even with the option to map complex elements as strings enabled).
Moreover the copy activity sometimes just gets stuck, and eventually runs up quite a bill. Overall it seems a bit unreliable and clunky, unfortunately.

I will try to preprocess the files using CETAS statements in the serverless SQL pool and/or container jobs running python next.

Thank you for your help!
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-15T09:20:43.4333333+00:00

@Finn Schmidt just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Finn Schmidt 86 Reputation points

2024-04-15T10:51:49.8366667+00:00

@phemanth

For the moment I will build a solution around regularly creating external tables for the newest json files using the serverless SQL pool, and then querying these external tables from within pyspark, running the needed transformation and writing to delta tables in the end. Once the pyspark process finishes the external tables can then be dropped, as they only serve to enable faster reads for pyspark and are no longer needed afterwards.

That seems to be the most effective option available (both in terms of compute and costs), but a robust retry strategy on the cetas execution will be needed, as the serverless sql pool sometimes simply times out or fails with "internal error" when creating external tables, and may block a secondary run from executing because the write location on the datalake now already exists (so the retry must include a step to clean up the directory from the failed first run beforehand).

Thank you for your support! Perhaps a future version of synapse spark will include a spark native solution to this issue.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-16T09:02:52.2533333+00:00

@Finn Schmidt did my response helped you ? Do let us know if you any further queries.
Finn Schmidt 86 Reputation points

2024-04-16T09:08:38.4633333+00:00

@phemanth

I'm a bit confused, what is this in response to?
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-04-17T08:26:18.4633333+00:00

@Finn Schmidt i mean did any of my information that is provided did helped you or do you have further questions?

Share via

Synapse Pyspark efficiently read large amount of small files from datalake

Your answer