Upload large amounts of random data in parallel to Azure storage
This tutorial is part two of a series. This tutorial shows you how to deploy an application that uploads large amount of random data to an Azure storage account.
In part two of the series, you learn how to:
- Configure the connection string
- Build the application
- Run the application
- Validate the number of connections
Microsoft Azure Blob Storage provides a scalable service for storing your data. To ensure your application is as performant as possible, an understanding of how blob storage works is recommended. Knowledge of the limits for Azure blobs is important, to learn more about these limits visit: Scalability and performance targets for Blob storage.
Partition naming is another potentially important factor when designing a high-performance application using blobs. For block sizes greater than or equal to 4 MiB, High-Throughput block blobs are used, and partition naming will not impact performance. For block sizes less than 4 MiB, Azure storage uses a range-based partitioning scheme to scale and load balance. This configuration means that files with similar naming conventions or prefixes go to the same partition. This logic includes the name of the container that the files are being uploaded to. In this tutorial, you use files that have GUIDs for names as well as randomly generated content. They are then uploaded to five different containers with random names.
Prerequisites
To complete this tutorial, you must have completed the previous Storage tutorial: Create a virtual machine and storage account for a scalable application.
Remote into your virtual machine
Use the following command on your local machine to create a remote desktop session with the virtual machine. Replace the IP address with the publicIPAddress of your virtual machine. When prompted, enter the credentials you used when creating the virtual machine.
mstsc /v:<publicIpAddress>
Configure the connection string
In the Azure portal, navigate to your storage account. Select Access keys under Settings in your storage account. Copy the connection string from the primary or secondary key. Log in to the virtual machine you created in the previous tutorial. Open a Command Prompt as an administrator and run the setx
command with the /m
switch, this command saves a machine setting environment variable. The environment variable is not available until you reload the Command Prompt. Replace <storageConnectionString> in the following sample:
setx storageconnectionstring "<storageConnectionString>" /m
Important
This code example uses a connection string to authorize access to your storage account. This configuration is for example purposes. Connection strings and account access keys should be used with caution in application code. If your account access key is lost or accidentally placed in an insecure location, your service may become vulnerable. Anyone who has the access key is able to authorize requests against the storage account, and effectively has access to all the data.
When finished, open another Command Prompt, navigate to D:\git\storage-dotnet-perf-scale-app
and type dotnet build
to build the application.
Run the application
Navigate to D:\git\storage-dotnet-perf-scale-app
.
Type dotnet run
to run the application. The first time you run dotnet
it populates your local package cache, to improve restore speed and enable offline access. This command takes up to a minute to complete and only happens once.
dotnet run
The application creates five randomly named containers and begins uploading the files in the staging directory to the storage account.
The UploadFilesAsync
method is shown in the following example:
private static async Task UploadFilesAsync()
{
// Create five randomly named containers to store the uploaded files.
BlobContainerClient[] containers = await GetRandomContainersAsync();
// Path to the directory to upload
string uploadPath = Directory.GetCurrentDirectory() + "\\upload";
// Start a timer to measure how long it takes to upload all the files.
Stopwatch timer = Stopwatch.StartNew();
try
{
Console.WriteLine($"Iterating in directory: {uploadPath}");
int count = 0;
Console.WriteLine($"Found {Directory.GetFiles(uploadPath).Length} file(s)");
// Specify the StorageTransferOptions
BlobUploadOptions options = new BlobUploadOptions
{
TransferOptions = new StorageTransferOptions
{
// Set the maximum number of workers that
// may be used in a parallel transfer.
MaximumConcurrency = 8,
// Set the maximum length of a transfer to 50MB.
MaximumTransferSize = 50 * 1024 * 1024
}
};
// Create a queue of tasks that will each upload one file.
var tasks = new Queue<Task<Response<BlobContentInfo>>>();
// Iterate through the files
foreach (string filePath in Directory.GetFiles(uploadPath))
{
BlobContainerClient container = containers[count % 5];
string fileName = Path.GetFileName(filePath);
Console.WriteLine($"Uploading {fileName} to container {container.Name}");
BlobClient blob = container.GetBlobClient(fileName);
// Add the upload task to the queue
tasks.Enqueue(blob.UploadAsync(filePath, options));
count++;
}
// Run all the tasks asynchronously.
await Task.WhenAll(tasks);
timer.Stop();
Console.WriteLine($"Uploaded {count} files in {timer.Elapsed.TotalSeconds} seconds");
}
catch (RequestFailedException ex)
{
Console.WriteLine($"Azure request failed: {ex.Message}");
}
catch (DirectoryNotFoundException ex)
{
Console.WriteLine($"Error parsing files in the directory: {ex.Message}");
}
catch (Exception ex)
{
Console.WriteLine($"Exception: {ex.Message}");
}
}
The following example is a truncated application output running on a Windows system.
Created container 2dbb45f4-099e-49eb-880c-5b02ebac135e
Created container 0d784365-3bdf-4ef2-b2b2-c17b6480792b
Created container 42ac67f2-a316-49c9-8fdb-860fb32845d7
Created container f0357772-cb04-45c3-b6ad-ff9b7a5ee467
Created container 92480da9-f695-4a42-abe8-fb35e71eb887
Iterating in directory: C:\git\myapp\upload
Found 5 file(s)
Uploading 1d596d16-f6de-4c4c-8058-50ebd8141e4d.pdf to container 2dbb45f4-099e-49eb-880c-5b02ebac135e
Uploading 242ff392-78be-41fb-b9d4-aee8152a6279.pdf to container 0d784365-3bdf-4ef2-b2b2-c17b6480792b
Uploading 38d4d7e2-acb4-4efc-ba39-f9611d0d55ef.pdf to container 42ac67f2-a316-49c9-8fdb-860fb32845d7
Uploading 45930d63-b0d0-425f-a766-cda27ff00d32.pdf to container f0357772-cb04-45c3-b6ad-ff9b7a5ee467
Uploading 5129b385-5781-43be-8bac-e2fbb7d2bd82.pdf to container 92480da9-f695-4a42-abe8-fb35e71eb887
Uploaded 5 files in 16.9552163 seconds
Validate the connections
While the files are being uploaded, you can verify the number of concurrent connections to your storage account. Open a console window and type netstat -a | find /c "blob:https"
. This command shows the number of connections that are currently opened. As you can see from the following example, 800 connections were open when uploading the random files to the storage account. This value changes throughout running the upload. By uploading in parallel block chunks, the amount of time required to transfer the contents is greatly reduced.
C:\>netstat -a | find /c "blob:https"
800
C:\>
Next steps
In part two of the series, you learned about uploading large amounts of random data to a storage account in parallel, such as how to:
- Configure the connection string
- Build the application
- Run the application
- Validate the number of connections
Advance to part three of the series to download large amounts of data from a storage account.