ADF CopyActivity for tar file Http Source to Blob Sink
Hi, I am using the ADF c# SDK to try and copy a single 15GB tar file from an http address to blob storage. I am using an Http and Azure Blob Storage connector for source & sink and Binary Datasets for both connectors. However, when I look at the file in Blob Storage, I see the file size repeatedly go up to ~2 MB then reset to 1015B, eventually the size of the file does not increase in Blob storage and stays at 1015B, even though the monitor pipeline shows that more bytes are being written.
Have I misconfigured something somewhere? I've attached the configuration for the pipeline/datasets/linked services in the question.
Azure Blob Storage
Azure Data Factory
-
Filmon Belay • 0 Reputation points • Microsoft Employee
2023-04-21T12:49:26.0566667+00:00 # LINKED SERVICES INFO { "name": "sinkLinkedService", "type": "Microsoft.DataFactory/factories/linkedservices", "properties": { "type": "AzureBlobStorage", "typeProperties": { "connectionString": "DefaultEndpointsProtocol=https;AccountName=rcdatasetstorage;", "accountKey": { "type": "AzureKeyVaultSecret", "store": { "referenceName": "keyVaultLinkedService", "type": "LinkedServiceReference" }, "secretName": "rcdatasetSecret" } }, "annotations": [] } } { "name": "sourceLinkedService", "type": "Microsoft.DataFactory/factories/linkedservices", "properties": { "type": "HttpServer", "typeProperties": { "url": "https://bulkdata.uspto.gov", "authenticationType": "Anonymous" }, "annotations": [] } } { "name": "sinkDataset", "properties": { "linkedServiceName": { "referenceName": "sinkLinkedService", "type": "LinkedServiceReference" }, "annotations": [], "type": "AzureBlob", "typeProperties": { "fileName": "app_pdf_20220106.tar", "folderPath": "yameng" } }, "type": "Microsoft.DataFactory/factories/datasets" } #### DATASET INFO { "name": "sourceDataset", "properties": { "linkedServiceName": { "referenceName": "sourceLinkedService", "type": "LinkedServiceReference" }, "annotations": [], "type": "HttpFile", "typeProperties": { "relativeUrl": "/data/patent/application/multipagepdf/2022/app_pdf_20220106.tar", "requestMethod": "GET" } }, "type": "Microsoft.DataFactory/factories/datasets" } ### COPY ACTIVITY INFO { "name": "CopySinkToSource", "type": "Copy", "dependsOn": [], "policy": { "retry": 0, "retryIntervalInSeconds": 30, "secureOutput": false, "secureInput": false }, "userProperties": [], "typeProperties": { "source": { "type": "BinarySource", "storeSettings": { "type": "HttpReadSettings", "maxConcurrentConnections": 32, "requestMethod": "GET" } }, "sink": { "type": "BinarySink", "storeSettings": { "type": "AzureBlobStorageWriteSettings" } }, "enableStaging": false, "parallelCopies": 4 }, "inputs": [ { "referenceName": "sourceDataset", "type": "DatasetReference" } ], "outputs": [ { "referenceName": "sinkDataset", "type": "DatasetReference" } ] } ### PIPELINE INFO { "name": "pipeline", "properties": { "activities": [ { "name": "CopySinkToSource", "type": "Copy", "dependsOn": [], "policy": { "retry": 0, "retryIntervalInSeconds": 30, "secureOutput": false, "secureInput": false }, "userProperties": [], "typeProperties": { "source": { "type": "BinarySource", "storeSettings": { "type": "HttpReadSettings", "maxConcurrentConnections": 32, "requestMethod": "GET" } }, "sink": { "type": "BinarySink", "storeSettings": { "type": "AzureBlobStorageWriteSettings" } }, "enableStaging": false, "parallelCopies": 4 }, "inputs": [ { "referenceName": "sourceDataset", "type": "DatasetReference" } ], "outputs": [ { "referenceName": "sinkDataset", "type": "DatasetReference" } ] } ], "annotations": [], "lastPublishTime": "2023-04-21T10:50:22Z" }, "type": "Microsoft.DataFactory/factories/pipelines" }
-
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-04-24T23:58:49.1133333+00:00 Hello Filmon Belay, Welcome to the MS Q&A platform. I am testing this from my end. I will get back to you with my findings. I appreciate your patience.
-
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-04-25T19:42:01.33+00:00 Hello Filmon Belay, With the same configuration, I was able to copy the file. It took 20 mins. Please check if there is any network latency causing the issue.
-
Filmon Belay • 0 Reputation points • Microsoft Employee
2023-04-26T15:43:40.8366667+00:00 @Bhargava-MSFT that is significantly faster; is your Blob Storage instance optimized in any way? I see your
Access tier
is 'Hot (inferred)' while I see 'Cool (Inferred)'. -
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-04-26T20:20:39.54+00:00 Hello Filmon Belay,
I am not using any optimization techniques. I changed my hot(inferred) to cool(inferred), which took almost the same time.
And I see you are using the 'Degree of copy parallelism setting' to 4. Can you leave it auto and try again?
Btw, my ADF and storage account are in East US
-
Filmon Belay • 0 Reputation points • Microsoft Employee
2023-04-26T21:58:09.4966667+00:00 @Bhargava-MSFT thanks for your responses and information about the location of the services.
I ran without setting parallel copies and the time stayed roughly the same. Will retest with both in same cloud. ADF factory is in South Central US and Blob Storage in West US 2 for me.
-
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-04-27T16:15:57.7666667+00:00 Hello Filmon Belay,
Thank you, and please let me know how it goes with the same region.
-
Filmon Belay • 0 Reputation points • Microsoft Employee
2023-04-28T19:52:10.3133333+00:00 @Bhargava-MSFT I have new blob storage setup but running into error with connection string being incorrect. Using same format as before which worked to connect to original Blob Storage account.
Are you familiar or ever seen the below error ?
Error code 9003
Details: Invalid storage connection string provided to 'UnknownLocation'. Check the storage connection string in configuration. No valid combination of account information found.
-
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-04-28T20:43:55.81+00:00 Hello Filmon Belay,
Are you seeing the error with the blob storage linked service connection or your pipeline is failing with the error?
Error code 9003 indicates that the provided storage connection string is invalid.
If you are using the account key, please check if the account name, account key are correct.
-
Filmon Belay • 0 Reputation points • Microsoft Employee
2023-04-28T20:59:01.4733333+00:00 Seeing it with the Blob Storage Linked Service.
One difference is I have my account key stored in key vault, reading secret value using a KeyVault linked service. This is how I was connecting to the other Blob Storage instance, and it worked well.
My connection string is
"DefaultEndpointsProtocol=https;AccountName=myaccountname;"
-
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-05-01T19:18:04.26+00:00 Hello Filmon Belay,
I see you manually entered the secret name by clicking the edit button. If you have configured it correctly, you should be able to see the secret name in the drop-down(as shown n my screenshot)
Per your error message, here are the possible reasons for the error.
- The connection string might be incorrect
- Verify the storage account access key
- verify the storage account name
- Check the permissions.
-
Filmon Belay • 0 Reputation points • Microsoft Employee
2023-05-02T15:48:00.6033333+00:00 Hey @Bhargava-MSFT thanks for your follow up. Yes, I am able to see the secret in the drop down; It has been present since my first attempt. Will get RBAC roles granted today on new Blob Storage, will be sure to confirm the connection string is correct.
Will post resolution to this, as well as speed test with Blob Storage in same cloud region.
-
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-05-02T20:45:16.2033333+00:00 Thank you, Filmon Belay
-
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-05-03T22:54:44.31+00:00 Hello Filmon Belay,
I am checking to see if you have any further questions here.
-
Filmon Belay • 0 Reputation points • Microsoft Employee
2023-05-04T20:40:02.4+00:00 Hi @Bhargava-MSFT thanks for the help on this topic. Was able to resolve the Connection string issue by updating the Secret that was stored in KeyVault.
Also ran the throughput test with a Blob Storage in the same cloud region (South Centeral) as the Data Factory instance but throughput only increased from
1.5 MB/s
to2.7 MB/s
while downloading and unzipping the UPTO dataset. If not unzipping the speed only increases to roughly2.0 MB/s
-
Bhargava-MSFT • 31,261 Reputation points • Microsoft Employee • Moderator
2023-05-04T23:35:30.7033333+00:00 Hello Filmon Belay,
Thank you for the details. It seems this is not a significant increase with the throughput in the same region.
As the next step, I would suggest filing a support request. So that the support engineer can look deeper into the backend logs to troubleshoot the issue further; if you need any help with the support request, please let me know.
Sign in to comment