(Azure Data Factory) Uncompressing an archive file and uploading to blob - how to handle invalid blob names that trigger the Bad Request error

Question

(Azure Data Factory) Uncompressing an archive file and uploading to blob - how to handle invalid blob names that trigger the Bad Request error

b5000 1

Hi everyone!

I am working on a simple Azure Data Factory pipeline that extracts a .tar.gz archive stored on a blog storage and stores the extracted files back on the blob storage. Here is my setup.

The pipeline only contains a Copy Data activity. The source of the activity is my .tar.gz blob. The sink is a blob container. I set the compression type of the source to TarGZip and set the compression type of my target blob container to None. This setup worked fine until I came across a .tar.gz that has this file inside:

"123fa5b3-fde5-4763-8100-1b0dabf33606. à¸ à¸£à¸ à¸ à¸£"

I tried making a file by that name on my computer and Windows accepted it as a valid file name but failed to upload it manually through Azure Storage Explorer. Removing the unreadable part at the end will allow the upload to go through. So I really suspect this strange naming to be the root cause.

Questions:

Is there a way for the Copy Data activity to skip files that fail to upload to the target blob container? As long as it logs which files it skips, I am all for it. From the documentation it seems the 'Enable Fault Tolerance" will do the job but the option is grayed out in the adf UI.
If anyone has a different idea on how to achieve the same things without using Azure Data Factory, please feel free to share. The reason I chose adf is that my .tar.gz is very huge (almost 1 TB in size) and others solution I can think of cannot handle files this big.

Thanks!

Jimmy B

svijay-MSFT 5,256 Reputation points Microsoft Employee Moderator

2021-08-30T12:14:33.817+00:00

Hello @b5000 (Jimmy B)

Welcome to the Microsoft Q&A platform!

The reason the fault tolerance is disabled is because, one prerequisite is not to have the compression format specified in the source/sink dataset.

Coming back to your scenario, I was doing some testing; I did a quick check at my end with the filename you had mentioned "123fa5b3-fde5-4763-8100-1b0dabf33606. à¸ à¸£à¸ à¸ à¸£" I was able to copy it using the copy data

I created a folder "New Folder" and added a file within it with the name '123fa5b3-fde5-4763-8100-1b0dabf33606. à¸ à¸£à¸ à¸ à¸£'. I compressed the file to New Folder.tar.gzip.

I uploaded this to the Source Blob (sourcea) and with the copy data activity - I moved, decompressed the data to the target blob (destinationa) - with the compression type none.

The below is the source dataset

This is the result post the copy activity, the file got copied to the Destination

Wanted to check with you whether there is any configuration I might be missing in addition - may be that is causing the error at your end. Also, if possible - can you share the complete error text that you have been provided with and share with me the pipeline run id?
svijay-MSFT 5,256 Reputation points Microsoft Employee Moderator

2021-09-03T12:02:10.907+00:00

Hello @b5000 - I was able to replicate the error at my end. However, I observed that the files with the valid names are getting copied without any issues. The files with the invalid names are not copied. Coming back to your query of getting the list of files that are not copied. I came up the below workaround.
svijay-MSFT 5,256 Reputation points Microsoft Employee Moderator

2021-09-03T12:02:23.65+00:00
You could enable the logging for the copy activity under the setting section :

Once the pipeline is run, you can download the logged copy activity which would be in the txt format.

You can run the below script against the generated log file:

$csv = Get-Content <PATH TO LOG FILE.txt> | ConvertFrom-Csv #To get just the count #$number_of_files_start_to_write = ($csv |?{$_.message -like '*Start to write file*'}).length #$number_of_files_Written = ($csv |?{$_.message -like '*Complete writing file*'}).length $start_to__write = $csv |?{$_.message -like '*Start to write file*'} | select OperationItem $written = $csv |?{$_.message -like '*Complete writing file*'} | select OperationItem (Compare-Object $start_to__write $written).inputObject

The log usually has information about the files that have begun to write and that have been completed successfully.

Skipped files can be obtained by comparing and getting the files that don't have "Complete Writing File".

Output :
b5000 1 Reputation point

2021-09-04T03:43:55.68+00:00

Hi @svijay-MSFT ,

Thanks for getting back. The good file is first in the archive so it will extract just fine. If we had a third file in the archive (alphabetically), it won't extract. The Copy Data process will stop as soon as it encounters the bad filename that won't upload to blog. Please advise.
svijay-MSFT 5,256 Reputation points Microsoft Employee Moderator

2021-09-13T19:09:36.167+00:00

@b5000 - Apologies for the delay. I had been testing things at my end. But unfortunately, was not able to find an alternative than that of the above script mentioned. the above script can still be used to find all folders that contains the invalid files.

1 answer

Your answer

svijay-MSFT 5,256 Reputation points Microsoft Employee Moderator

2021-08-30T12:14:33.817+00:00

Hello @b5000 (Jimmy B)

Welcome to the Microsoft Q&A platform!

The reason the fault tolerance is disabled is because, one prerequisite is not to have the compression format specified in the source/sink dataset.

Coming back to your scenario, I was doing some testing; I did a quick check at my end with the filename you had mentioned "123fa5b3-fde5-4763-8100-1b0dabf33606. à¸ à¸£à¸ à¸ à¸£" I was able to copy it using the copy data

I created a folder "New Folder" and added a file within it with the name '123fa5b3-fde5-4763-8100-1b0dabf33606. à¸ à¸£à¸ à¸ à¸£'. I compressed the file to New Folder.tar.gzip.

I uploaded this to the Source Blob (sourcea) and with the copy data activity - I moved, decompressed the data to the target blob (destinationa) - with the compression type none.

The below is the source dataset

This is the result post the copy activity, the file got copied to the Destination

Wanted to check with you whether there is any configuration I might be missing in addition - may be that is causing the error at your end. Also, if possible - can you share the complete error text that you have been provided with and share with me the pipeline run id?
svijay-MSFT 5,256 Reputation points Microsoft Employee Moderator

2021-09-03T12:02:10.907+00:00

Hello @b5000 - I was able to replicate the error at my end. However, I observed that the files with the valid names are getting copied without any issues. The files with the invalid names are not copied. Coming back to your query of getting the list of files that are not copied. I came up the below workaround.
svijay-MSFT 5,256 Reputation points Microsoft Employee Moderator

2021-09-03T12:02:23.65+00:00

You could enable the logging for the copy activity under the setting section :

Once the pipeline is run, you can download the logged copy activity which would be in the txt format.

You can run the below script against the generated log file:

$csv = Get-Content <PATH TO LOG FILE.txt> | ConvertFrom-Csv #To get just the count #$number_of_files_start_to_write = ($csv |?{$_.message -like '*Start to write file*'}).length #$number_of_files_Written = ($csv |?{$_.message -like '*Complete writing file*'}).length $start_to__write = $csv |?{$_.message -like '*Start to write file*'} | select OperationItem $written = $csv |?{$_.message -like '*Complete writing file*'} | select OperationItem (Compare-Object $start_to__write $written).inputObject

The log usually has information about the files that have begun to write and that have been completed successfully.

Skipped files can be obtained by comparing and getting the files that don't have "Complete Writing File".

Output :
b5000 1 Reputation point

2021-09-04T03:43:55.68+00:00

Hi @svijay-MSFT ,

Thanks for getting back. The good file is first in the archive so it will extract just fine. If we had a third file in the archive (alphabetically), it won't extract. The Copy Data process will stop as soon as it encounters the bad filename that won't upload to blog. Please advise.
svijay-MSFT 5,256 Reputation points Microsoft Employee Moderator

2021-09-13T19:09:36.167+00:00

@b5000 - Apologies for the delay. I had been testing things at my end. But unfortunately, was not able to find an alternative than that of the above script mentioned. the above script can still be used to find all folders that contains the invalid files.

Answer 1

Simon Poortman 1

Set all files active to my account anonymized@USER

Share via

(Azure Data Factory) Uncompressing an archive file and uploading to blob - how to handle invalid blob names that trigger the Bad Request error

1 answer

Your answer