How can I optimize the upsert method in a Synapse Copy activity when transferring data from ADLS (using a SQL query) to MongoDB?

Maverick 0 Reputation points
2025-04-24T12:57:55.43+00:00

I’ve built a pipeline that copies data to MongoDB using an on-demand SQL query with upsert functionality. The MongoDB collection currently holds around 300 million records. My goal is to update approximately 50,000 records daily and insert around 700,000 new ones.

When testing with sample data (smaller in volume), the pipeline performs as expected. However, when running it with the full dataset, I notice that 2–3% of the data becomes corrupted. While the pipeline reliably inserts new records each day, it doesn’t consistently update existing records as intended. In fact, it appears that some existing records are being overwritten or incorrectly updated, even when there’s no actual match between the new and existing entries. This suggests that the upsert logic might be misidentifying records during the update process.

I've already defined a unique identifier in the dataset using the _id field. However, in Synapse UI, I’m unable to explicitly assign this key for use in the upsert operation.

  • Would it be a valid approach to include something in the JSON payload—such as "idFieldName": "_id"—to help MongoDB correctly identify records for update?
  • How can I ensure that the _id field is used correctly in Synapse to prevent unwanted overwrites during upserts?
  • Also, why does the Synapse Copy activity struggle with upserts at this scale, even though it performs well with smaller datasets?
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,328 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Venkat Reddy Navari 1,690 Reputation points Microsoft External Staff Moderator
    2025-04-24T15:58:02.54+00:00

    Hi @Maverick

    The issue lies in how the Synapse Copy activity handles upserts into MongoDB at scale. Here’s how you can address it effectively:

    Upsert Key Handling:

    • Synapse currently doesn't allow explicit assignment of the _id field as the upsert key in the UI. Settings like "idFieldName": "_id" aren’t supported.
      To ensure correct upsert behavior, make sure the _id field is included in your source query and is correctly mapped in the sink schema.

    Preventing Incorrect Updates: Misupdates or data corruption can occur if:

    • _id is missing, duplicated, or incorrectly mapped
    • Synapse mishandles nested fields during transformation
    • Validate your source data and confirm the uniqueness and correct mapping of _id.

    Performance at Scale: While the pipeline works well on smaller datasets, performance issues can appear at higher volumes due to:

    • Lack of parallelism or partitioning
    • MongoDB throttling or write concern issues Use parallel copy settings and partition your source query to improve performance and stability.

    Alternative Approach for Large Datasets: If issues persist, consider using a Custom Activity (e.g., Azure Function or Databricks) to handle the upsert logic with MongoDB drivers. This provides full control over _id handling and update logic.

    Hope this helps. Do let us know if you any further queries.

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.