Copy from CosmosDB to CosmosDB Error: "Request size is too large" using Data Factory

Keith Miller 1 Reputation point
2020-08-18T05:18:59.823+00:00

I am using the basic Copy wizard in ADF v2. I have a source Cosmosdb in one Subscription, and moving to a new Cosmosdb in a new subscription Database and Containers are configured the same.
I have one container that copies 212 of 216 documents and fails with the "Request size is too large" error.

I have set Write batch size = 1 and parallel = 1 , just about every setting I've seen recommended. I am thinking this has to be a an issue with the copy task.
I can copy this same container using the DTUI.exe tool. I would prefer to use ADF to be able to set a schedule to copy any new documents to the new container as a pre-migration process as we move to a new subscription.

Any ideas?
"errorCode": "2200",
"message": "ErrorCode=UserErrorDocumentDBWriteError,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Documents failed to import

Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,447 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,599 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. HarithaMaddi-MSFT 10,131 Reputation points
    2020-08-18T10:38:04.27+00:00

    Hi @Keith Miller ,

    Welcome to Microsoft Q&A Platform and thanks for the query.

    As per MS documentation, Cosmos DB limits single request's size to 2MB. The formula is Request Size = Single Document Size * Write Batch Size.

    The workaround suggested if it exceeds this is to split the document into multiple partitions, after that use the For Each activity with a Copy Activity to write every file into Cosmos DB as shown below.

    Adding the JSONs as requested:

    Pipeline JSON - pl_splitData

    {  
        "name": "pl_splitData",  
        "properties": {  
            "activities": [  
                {  
                    "name": "splitData",  
                    "type": "ExecuteDataFlow",  
                    "dependsOn": [],  
                    "policy": {  
                        "timeout": "7.00:00:00",  
                        "retry": 0,  
                        "retryIntervalInSeconds": 30,  
                        "secureOutput": false,  
                        "secureInput": false  
                    },  
                    "userProperties": [],  
                    "typeProperties": {  
                        "compute": {  
                            "coreCount": 8,  
                            "computeType": "General"  
                        }  
                    }  
                },  
                {  
                    "name": "ForEach1",  
                    "type": "ForEach",  
                    "dependsOn": [  
                        {  
                            "activity": "Get Metadata1",  
                            "dependencyConditions": [  
                                "Succeeded"  
                            ]  
                        }  
                    ],  
                    "userProperties": [],  
                    "typeProperties": {  
                        "items": {  
                            "value": "@activity('Get Metadata1').output.childItems",  
                            "type": "Expression"  
                        },  
                        "isSequential": true,  
                        "activities": [  
                            {  
                                "name": "Copy data1",  
                                "type": "Copy",  
                                "dependsOn": [],  
                                "policy": {  
                                    "timeout": "7.00:00:00",  
                                    "retry": 0,  
                                    "retryIntervalInSeconds": 30,  
                                    "secureOutput": false,  
                                    "secureInput": false  
                                },  
                                "userProperties": [],  
                                "typeProperties": {  
                                    "source": {  
                                        "type": "JsonSource",  
                                        "storeSettings": {  
                                            "type": "AzureBlobStorageReadSettings",  
                                            "recursive": true,  
                                            "wildcardFolderPath": "toCosmos",  
                                            "wildcardFileName": {  
                                                "value": "@item().name",  
                                                "type": "Expression"  
                                            },  
                                            "enablePartitionDiscovery": false  
                                        },  
                                        "formatSettings": {  
                                            "type": "JsonReadSettings"  
                                        }  
                                    },  
                                    "sink": {  
                                        "type": "CosmosDbSqlApiSink",  
                                        "writeBehavior": "upsert",  
                                        "disableMetricsCollection": false  
                                    },  
                                    "enableStaging": false  
                                },  
                                "outputs": [  
                                    {  
                                        "referenceName": "CosmosDbSqlApiCollection1",  
                                        "type": "DatasetReference"  
                                    }  
                                ]  
                            }  
                        ]  
                    }  
                },  
                {  
                    "name": "Get Metadata1",  
                    "type": "GetMetadata",  
                    "dependsOn": [  
                        {  
                            "activity": "splitData",  
                            "dependencyConditions": [  
                                "Succeeded"  
                            ]  
                        }  
                    ],  
                    "policy": {  
                        "timeout": "7.00:00:00",  
                        "retry": 0,  
                        "retryIntervalInSeconds": 30,  
                        "secureOutput": false,  
                        "secureInput": false  
                    },  
                    "userProperties": [],  
                    "typeProperties": {  
                        "fieldList": [  
                            "childItems"  
                        ],  
                        "storeSettings": {  
                            "type": "AzureBlobStorageReadSettings",  
                            "recursive": true  
                        },  
                        "formatSettings": {  
                            "type": "JsonReadSettings"  
                        }  
                    }  
                }  
            ],  
            "annotations": []  
        }  
    }  
    

    DataFlow JSON - splitData

    {  
        "name": "splitData",  
        "properties": {  
            "type": "MappingDataFlow",  
            "typeProperties": {  
                "sources": [  
                    {  
                        "dataset": {  
                            "referenceName": "input_json_59MB",  
                            "type": "DatasetReference"  
                        },  
                        "name": "source1"  
                    }  
                ],  
                "sinks": [  
                    {  
                        "dataset": {  
                            "referenceName": "ds_OutputJSON",  
                            "type": "DatasetReference"  
                        },  
                        "name": "sink1"  
                    }  
                ],  
                "transformations": [],  
                "script": "source(output(\n\t\ttype as string,\n\t\tcrs as (type as string, properties as (name as string)),\n\t\tfeatures as (type as string, properties as (pid as string, lat_max as double, lat_min as double, long_max as double, long_min as double), geometry as (type as string, coordinates as double[][][][]))[]\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\tsingleDocument: true) ~> source1\nsource1 sink(input(\n\t\ttype as string,\n\t\tcrs as (type as string, properties as (name as string)),\n\t\tfeatures as (type as string, properties as (pid as string, lat_max as double, lat_min as double, long_max as double, long_min as double), geometry as (type as string, coordinates as double[][][][]))[]\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\tfilePattern:'file[n].json',\n\ttruncate: true,\n\tpartitionBy('roundRobin', 50),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> sink1"  
            }  
        }  
    }  
    

    18289-pipelinesnap.png

    18275-partitioning.png

    Hope this helps! Please let us know if the issue persists and we will be glad to assist.