How to distribute the contents of a spark dataframe over different azure storage containers

Dirk Vrancken 21 Reputation points
2023-03-03T07:54:52.65+00:00

I want to build a multi-tenant deltalake based on a database which contains all data of all tenants. I am using Azure Synapse Pipelines and Spark Notebooks. In the database there is one table which contains all tenants. Besides that we have several other tables which are linked to the tenant table. I want to store data of each tenant in a different storage container. I have a table with organisation units. Each organisation unit belongs to a tenant. A tenant can have multiple organisation units. OrgUnitName TenantId Dept 1 1 Dept 1.1 1 Dept 2 1 Sales 2 Finance 2

I want to come to situation in my azure storage account that I have for each tenant a storage container. In that storage container, I want to create a delta table for each table in the source database. My first idea was to build a foreach loop in Pipelines over the different tenants. For each tenant I would call a spark notebook which would load each individual table. I have the impression that this process is very very slow. I would like to use the parallellism power of Spark in order to achieve this, but the "partitionBy"-functionality of a dataframe is always limited to one storage container. Any advice?

See mainly above. Everthing is explained there :-)

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,338 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Dirk Vrancken 21 Reputation points
    2023-03-24T21:18:09.9966667+00:00

    Hi @MartinJaffer-MSFT ,

    At this moment it is solved by a simple For loop in the Spark Notebook script. I loop over every tenant, and for each tenant I can now use the normal "PartitionBy" operation.

    0 comments No comments