Upload large amounts of random data in parallel to Azure storage

2024-08-14

This tutorial is part two of a series. This tutorial shows you how to deploy an application that uploads large amount of random data to an Azure storage account.

In part two of the series, you learn how to:

Configure the connection string
Build the application
Run the application
Validate the number of connections

Microsoft Azure Blob Storage provides a scalable service for storing your data. To ensure your application is as performant as possible, an understanding of how blob storage works is recommended. Knowledge of the limits for Azure blobs is important, to learn more about these limits visit: Scalability and performance targets for Blob storage.

Partition naming is another potentially important factor when designing a high-performance application using blobs. For block sizes greater than or equal to 4 MiB, High-Throughput block blobs are used, and partition naming will not impact performance. For block sizes less than 4 MiB, Azure storage uses a range-based partitioning scheme to scale and load balance. This configuration means that files with similar naming conventions or prefixes go to the same partition. This logic includes the name of the container that the files are being uploaded to. In this tutorial, you use files that have GUIDs for names as well as randomly generated content. They are then uploaded to five different containers with random names.

Prerequisites

To complete this tutorial, you must have completed the previous Storage tutorial: Create a virtual machine and storage account for a scalable application.

Remote into your virtual machine

Use the following command on your local machine to create a remote desktop session with the virtual machine. Replace the IP address with the publicIPAddress of your virtual machine. When prompted, enter the credentials you used when creating the virtual machine.

mstsc /v:<publicIpAddress>

Configure the connection string

In the Azure portal, navigate to your storage account. Select Access keys under Settings in your storage account. Copy the connection string from the primary or secondary key. Log in to the virtual machine you created in the previous tutorial. Open a Command Prompt as an administrator and run the setx command with the /m switch, this command saves a machine setting environment variable. The environment variable is not available until you reload the Command Prompt. Replace <storageConnectionString> in the following sample:

setx storageconnectionstring "<storageConnectionString>" /m

Important

This code example uses a connection string to authorize access to your storage account. This configuration is for example purposes. Connection strings and account access keys should be used with caution in application code. If your account access key is lost or accidentally placed in an insecure location, your service may become vulnerable. Anyone who has the access key is able to authorize requests against the storage account, and effectively has access to all the data.

When finished, open another Command Prompt, navigate to D:\git\storage-dotnet-perf-scale-app and type dotnet build to build the application.

Run the application

Navigate to D:\git\storage-dotnet-perf-scale-app.

Type dotnet run to run the application. The first time you run dotnet it populates your local package cache, to improve restore speed and enable offline access. This command takes up to a minute to complete and only happens once.

dotnet run

The application creates five randomly named containers and begins uploading the files in the staging directory to the storage account.

The UploadFilesAsync method is shown in the following example:

private static async Task UploadFilesAsync()
{
    // Create five randomly named containers to store the uploaded files.
    BlobContainerClient[] containers = await GetRandomContainersAsync();

    // Path to the directory to upload
    string uploadPath = Directory.GetCurrentDirectory() + "\\upload";

    // Start a timer to measure how long it takes to upload all the files.
    Stopwatch timer = Stopwatch.StartNew();

    try
    {
        Console.WriteLine($"Iterating in directory: {uploadPath}");
        int count = 0;

        Console.WriteLine($"Found {Directory.GetFiles(uploadPath).Length} file(s)");

        // Specify the StorageTransferOptions
        BlobUploadOptions options = new BlobUploadOptions
        {
            TransferOptions = new StorageTransferOptions
            {
                // Set the maximum number of workers that 
                // may be used in a parallel transfer.
                MaximumConcurrency = 8,

                // Set the maximum length of a transfer to 50MB.
                MaximumTransferSize = 50 * 1024 * 1024
            }
        };

        // Create a queue of tasks that will each upload one file.
        var tasks = new Queue<Task<Response<BlobContentInfo>>>();

        // Iterate through the files
        foreach (string filePath in Directory.GetFiles(uploadPath))
        {
            BlobContainerClient container = containers[count % 5];
            string fileName = Path.GetFileName(filePath);
            Console.WriteLine($"Uploading {fileName} to container {container.Name}");
            BlobClient blob = container.GetBlobClient(fileName);

            // Add the upload task to the queue
            tasks.Enqueue(blob.UploadAsync(filePath, options));
            count++;
        }

        // Run all the tasks asynchronously.
        await Task.WhenAll(tasks);

        timer.Stop();
        Console.WriteLine($"Uploaded {count} files in {timer.Elapsed.TotalSeconds} seconds");
    }
    catch (RequestFailedException ex)
    {
        Console.WriteLine($"Azure request failed: {ex.Message}");
    }
    catch (DirectoryNotFoundException ex)
    {
        Console.WriteLine($"Error parsing files in the directory: {ex.Message}");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Exception: {ex.Message}");
    }
}

The following example is a truncated application output running on a Windows system.

Created container 2dbb45f4-099e-49eb-880c-5b02ebac135e
Created container 0d784365-3bdf-4ef2-b2b2-c17b6480792b
Created container 42ac67f2-a316-49c9-8fdb-860fb32845d7
Created container f0357772-cb04-45c3-b6ad-ff9b7a5ee467
Created container 92480da9-f695-4a42-abe8-fb35e71eb887
Iterating in directory: C:\git\myapp\upload
Found 5 file(s)
Uploading 1d596d16-f6de-4c4c-8058-50ebd8141e4d.pdf to container 2dbb45f4-099e-49eb-880c-5b02ebac135e
Uploading 242ff392-78be-41fb-b9d4-aee8152a6279.pdf to container 0d784365-3bdf-4ef2-b2b2-c17b6480792b
Uploading 38d4d7e2-acb4-4efc-ba39-f9611d0d55ef.pdf to container 42ac67f2-a316-49c9-8fdb-860fb32845d7
Uploading 45930d63-b0d0-425f-a766-cda27ff00d32.pdf to container f0357772-cb04-45c3-b6ad-ff9b7a5ee467
Uploading 5129b385-5781-43be-8bac-e2fbb7d2bd82.pdf to container 92480da9-f695-4a42-abe8-fb35e71eb887
Uploaded 5 files in 16.9552163 seconds

Validate the connections

While the files are being uploaded, you can verify the number of concurrent connections to your storage account. Open a console window and type netstat -a | find /c "blob:https". This command shows the number of connections that are currently opened. As you can see from the following example, 800 connections were open when uploading the random files to the storage account. This value changes throughout running the upload. By uploading in parallel block chunks, the amount of time required to transfer the contents is greatly reduced.

C:\>netstat -a | find /c "blob:https"
800

C:\>

Next steps

In part two of the series, you learned about uploading large amounts of random data to a storage account in parallel, such as how to:

Configure the connection string
Build the application
Run the application
Validate the number of connections

Advance to part three of the series to download large amounts of data from a storage account.

Download large amounts of random data from Azure storage

Share via