Hi @Maverick
The issue lies in how the Synapse Copy activity handles upserts into MongoDB at scale. Here’s how you can address it effectively:
Upsert Key Handling:
- Synapse currently doesn't allow explicit assignment of the _id field as the upsert key in the UI. Settings like "idFieldName": "_id" aren’t supported.
To ensure correct upsert behavior, make sure the _id field is included in your source query and is correctly mapped in the sink schema.
Preventing Incorrect Updates: Misupdates or data corruption can occur if:
- _id is missing, duplicated, or incorrectly mapped
- Synapse mishandles nested fields during transformation
- Validate your source data and confirm the uniqueness and correct mapping of _id.
Performance at Scale: While the pipeline works well on smaller datasets, performance issues can appear at higher volumes due to:
- Lack of parallelism or partitioning
- MongoDB throttling or write concern issues Use parallel copy settings and partition your source query to improve performance and stability.
Alternative Approach for Large Datasets: If issues persist, consider using a Custom Activity (e.g., Azure Function or Databricks) to handle the upsert logic with MongoDB drivers. This provides full control over _id handling and update logic.
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.