Copy Azure Fileshare to blob and change everything to lowercase, using ADF?

P.Deschuytter 26 Reputation points
2023-03-15T17:38:58.4533333+00:00

Hi,

@Work, we are looking to move an azure FileShare to an Azure Blob.

The fileshare exists of 5 containers with each folders, subfolders and files.

Total size of the data, around 9 TB.

I'm looking for the most efficient way to copy them.

There is one big naming-convention to be put in place: everything has to go to lower cases on the blob. This because the blob is case-sensitive and we have interfaces interacting on the FS/Blob.

The initial goal was to use AzCopy. But AzCopy doesn't have a 'toLower'-option.

So i created a general script that starts per rootfolder a containerinstance. The container then runs on all subfolders from the specified rootfolder. It lists the files, tolower them and copies using azcopy. File per file. Everything is scripted so jobs of max 20 files are being created.

As you can imagine, performance is much lower than using just azcopy.

In the end, I stumbled onto Azure Data Factory. Reading this post , I'm able to see that it should be possible to perform following actions:

  1. Specify specific FS-container to be copied
  2. Loop all folders and subfolders of that container
  3. copy the files/folders to blobstorage (same 'directorystructure') and change the directory and filename to lowercase.

preferably everthing is quite performant as we are moving 9TB...

Is it possible to give me kinda support or initial actions that I can do using ADF, as I just found it about an hour ago :s

Kind regards,

Pieter

Azure Files
Azure Files
An Azure service that offers file shares in the cloud.
1,304 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,926 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,854 questions
0 comments No comments
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,096 Reputation points
    2023-03-20T22:33:17.5166667+00:00

    @P.Deschuytter TLDR of the above.

    1. Use AzCopy to copy to Gen2 all files with lowercase names, even if the filepath contains uppercase. Use the --exclude-regex option for this.
    2. Use AzCopy --include-regex with same expression and --dry-run option to get list of files with uppercase names
    3. Iterate over the list produced by step 2, copying and changing filename but keeping path the same. This can run in parallel with step 1. AzCopy.
    4. When both step 1 and step 3 complete, and any failed copies taken care of, rename the directories.

    On a side note, seems they added rename file capability to Gen2. It was hidden under "create".

    1 person found this answer helpful.

2 additional answers

Sort by: Most helpful
  1. P.Deschuytter 26 Reputation points
    2023-03-22T15:53:47.8533333+00:00

    Hi @MartinJaffer-MSFT

    After some internal reviews we came up with the following copy-approach.

    The current file share (to be copied) is synced through an azure filesync with some diskspace in a vm.

    I made a little test and when changing the directory names (Powershell change directoryname to lower) apparantley doesn't trigger the filesync for the files in that folder. I find this rather odd, however it's an advantage for this approach. I was afraid that every subsequent file would have been updated to the new path in the azure file storage and thus the filesync service would have enormous work.

    Then the filenames itself: After extra review it is not that important that the filename is in lowercases. Mostly the interfacing/users enter a folder path to be able to search/place items. When the logic then gets the filenames, it immediatley gets the correct filenames.

    So we will test the renaming of the directory structure to lowercases and if all goes well we should be able to use the azcopy afterwards for the copy to azure blob.

    Then the ADL Gen2: We will investigate this further. But I will have to following the internal bureaucracy for this as this is an infrastructural change. Thx for mentioning, I'm sure this will be in future implementations :)

    Thanks again for thinking with me on this. I (and we @work) learned a lot during this matter.

    Kind regards,

    Pieter

    0 comments No comments

  2. MartinJaffer-MSFT 26,096 Reputation points
    2023-03-17T15:48:22.6066667+00:00

    @P.Deschuytter Hello and welcome to Microsoft Q&A.

    I understand you are trying to migrate from Azure Storage Fileshare to Azure Storage Blob while altering file names. You are seeking advice.

    You are correct it is technically possible. However it will be rather unpleasant. Before I get to that, I'd like to make a suggestion.

    You are currently targeting Blob storage. I highly recommend you do Azure Data Lake Gen2 instead. That is, enabling Hierarchical namespace feature on your blob storage account. This feature can only be turned on at creation. It will improve performance and access times by making the folders and subfolders 'real' and not just imaginary. Without Hierarchical namespace, all the blobs are sharing the same namespace, the folders are really just parts of the file name. Hierarchical namespace allows you to do fine grain access control.

    Okay, with that said, onto the Data Factory part.

    So ADF Copy activity has a "Recursively" option on the source side. This lets all the subfolders get included in the copy. There is also an option to preserve the folder structure. There is also an option to preserve file attributes. These let you copy everything in one go.

    However, these will preserve the file names, without opportunity to change to lowercase.

    To copy while change to lowercase, one needs to first get list of files, iterate, and then specify them in the copy activity. Both in sink and source. It ends up being a lot like your current solution, except there are limits on Get Metadata (to list files) and levels of looping (no loops within loops), which make deep nesting problematic, and turns into a breadth-first-tree-traversal. Each layer of folder depth makes the pipeline more complicated. It really is painful.

    The Fileshare has the "Rename File" operation, which is absent in both blob and gen2. Instead of renaming at copy time, can you rename first, then copy? I think that might be more performant with available tools. TBH, I almost never touch the Fileshare stuff.

    Although there is no rename operation on blobs, there is a "copy blob" operation which could do the renaming for you. Unfortunately it is blob-to-blob. It could be useful if you copied all data with filenames as-is, then set a script to go through and do the copy-rename.

    There are some characters permitted in Fileshare naming that are not permitted in blob / gen2 naming.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.