How to distribute the contents of a spark dataframe over different azure storage containers

Question

I want to build a multi-tenant deltalake based on a database which contains all data of all tenants. I am using Azure Synapse Pipelines and Spark Notebooks. In the database there is one table which contains all tenants. Besides that we have several other tables which are linked to the tenant table. I want to store data of each tenant in a different storage container. I have a table with organisation units. Each organisation unit belongs to a tenant. A tenant can have multiple organisation units. OrgUnitName TenantId Dept 1 1 Dept 1.1 1 Dept 2 1 Sales 2 Finance 2

I want to come to situation in my azure storage account that I have for each tenant a storage container. In that storage container, I want to create a delta table for each table in the source database. My first idea was to build a foreach loop in Pipelines over the different tenants. For each tenant I would call a spark notebook which would load each individual table. I have the impression that this process is very very slow. I would like to use the parallellism power of Spark in order to achieve this, but the "partitionBy"-functionality of a dataframe is always limited to one storage container. Any advice?

See mainly above. Everthing is explained there :-)

Answer

Hi @MartinJaffer-MSFT ,

At this moment it is solved by a simple For loop in the Spark Notebook script. I loop over every tenant, and for each tenant I can now use the normal "PartitionBy" operation.

How to distribute the contents of a spark dataframe over different azure storage containers

1 answer