Combining Data from Two CosmosDB Containers Using Azure Data Factory

Question

I am working on a project where I need to combine data from two different CosmosDB containers and save the result in a third container using Azure Data Factory (ADF). I have two containers, one containing a nested array of products, and another with information about the products. The goal is to combine the information about the products from both containers and save the result in a third container.

Container 1 has the following structure:

[
    {
        "id": "123",
        "someField": "someInformation"
        "products": [
            {
                "id": "123456789",
                "value": [
                    "xxx",
                    "yyy"
                ]
            }
        ]
    },
    {...}
]

Container 2 has the following structure:

[
    {
        "id": "123456789",
        "name": "EXAMPLE",
        "company": 222,
        "type": "EXAMPLETYPE"
    },
    {...}
]

The desired output in the third container should have the following structure:

[
    {
        "id": "123",
        "someField": "someInformation"
        "products": [
            {
                "id": "123456789",
                "name": "EXAMPLE",
                "company": 222,
                "type": "EXAMPLETYPE",
                "value": [
                    "xxx",
                    "yyy"
                ]
            }
        ]
    },
    {...}
]

I would like to know how to achieve this using Azure Data Factory. Any guidance or help would be appreciated.

Answer

In a ADF pipeline, add 2 datasets inputs for Cosmos DB (1 for each container) and configure the connection strings/DB names for each. Then add a join activity and connect the two inputs to it.

Choose the appropriate join type and specify the join key and then proceed with adding the mapping data flow and connect it to the join activity.

You can use a derived column transformation to concatenate the "value" field from container 1 with the fields from container 2.

If you need to exclude unwanted fields you can use a select transformation.

Add a Cosmo DB output dataset to your pipeline and connect the mapping data flow to it.

You can select the Upsert write mode to follow the logic of the incremental load.

Answer

Sorry for the late answer. I found a solution for this issue.

In truth, I was not able to use the JOIN operation just by adding two sources in Azure Data Factory. Instead, I needed to use the "Flatten" operation first on the nested array in Container 1. The Flatten operation generates one output row per array element, allowing me to use the extracted IDs for the join operation with Container 2.

Once the join is done, I used the "Aggregate" operation to group the data back into the nested array structure.

Overall, the solution involved the following steps:

Add two datasets as inputs for the Cosmos DB containers.
Use the Flatten operation on the nested array in Container 1 to extract the IDs for the join operation.
Use the join operation to combine data from both containers based on the extracted IDs.
Use the Aggregate operation to group the data back into the nested array structure.
Add a Cosmos DB output dataset to the pipeline and connect it to the mapping data flow.

Thank you Amira for trying to help me.

Combining Data from Two CosmosDB Containers Using Azure Data Factory

2 answers