Copy Activity from Cosmos to Cosmos not doing what I expect

ralphjjr 106 Reputation points
2022-01-06T21:58:32.337+00:00

So here is an interesting one...

I have two containers in Cosmos SQL: Events and Customers.
Events are things that happen to customer records and Customers is unique per customer based on a customer id.
Events have a customer id and an event id.

I can run one big upsert into Customers from the Customers API. But I can't fetch updates from the Customers API endpoint, I need use the Events api to get those updates.
As you can imagine, there could be multiple events for a single customer. one event when its created, and then another each time any record is modified.

My update pipeline looks like this:

a. read events from the original events api and copy into Events container. This container has a TTL of 1801 seconds.
b. then I run 2 separate copy activities:

  1. Source: Read from the events cosmos container using this sql: select * from c WHERE c.event_type='customer.created' order by c.event_created asc - Sink into the Customers cosmos container with upsert and I properly map the id so that the correct customer record gets updated.
  2. Source: Read from the events cosmos container using this sql: select * from c WHERE c.event_type='customer.updated' order by c.event_created asc - Sink into the Customers cosmos container with upsert and I properly map the id so that the correct customer record gets updated.

This all works great - until there are multiple customer.updated records for the same customer_id. For some reason, the updates don't actually happen in any particular order. I have one use case where there are 3 different events for the same customer id in a given period. When I run the 2 copy for just that customer id (adding it to the where clause) I properly get only the most recent record. But as soon as more records are involved (meaning that the query looks exactly as you see above) it only ever upsets the first record, or some other record, that is never actually the most recent.

I tried setting the degree of copy parallelism to 1. But that did not help. Am I hitting some limitation in terms of order of updates/upserts? I'm trying to force the engine to upsert based on that event_created date in ascending order. I assume the lowest date would get upserted first, and then the subsequent ones, in ascending order. Is that an incorrect assumption?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,369 questions
Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,442 questions
{count} votes