question

Chaoyue-5700 avatar image
0 Votes"
Chaoyue-5700 asked MartinJaffer-MSFT commented

Data flow sink step in azure data factory create empty blob alongside the folder

Hi, When I use dataflow supported by azure, in the sink step it will create an extra empty file alongside the folder, which is not what I want, please see screen below.

144961-empty-file.png




I searched and found there are similar questions raised by others and those questions haven't been resolved yet.
https://docs.microsoft.com/en-us/answers/questions/382374/writing-to-parquet-creates-empty-blob.html
https://stackoverflow.com/questions/68662065/extra-blob-created-after-sink-in-data-flow

I think this could be some bug here, not sure the inner implementation of data flow but I found an issue in adlfs and it has been closed in v0.6.3. Not sure whether it could help but I just paste the link here: https://github.com/dask/adlfs/issues/137

My ask here is whether we can fix this empty blob get created issue, and if it cannot be fixed, any workaround here?

azure-data-factoryazure-blob-storage
empty-file.png (38.2 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Did my response help you @Chaoyue-5700 ? If it solved your issue, please mark as accepted answer, otherwise let me know how I may better assist.

0 Votes 0 ·

1 Answer

MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered

Hello @Chaoyue-5700 and welcome to Microsoft Q&A. Please allow me to explain what I know related to this issue. This starts with elaborating on the difference a Hierarchical Namespace makes.

BLOB storage and Data Lake Gen 2 are very similar. Data Lake Gen 2 is the result of enabling the hierarchical namespace feature on the storage account. The effect is to enhance the blob storage so that folders are no longer virtual, but actually have an entry. The entry, when viewed via blob driver, presents as an empty blob.

An empty blob can be created when you try to use a Data Lake Gen 2 driver to write to blob storage (without hierarchical namespace). In an attempt to make the new folder, this empty blob is created.

To better visualize, imagine a filing cabinet. The filing cabinet is your storage account. The drawers are the containers.
Without heirarchical namespace (blob storage), you are placing papers directly into the drawer. The papers might be sorted and have parts of their names declaring some papers belonging together , but they are still loose.
With heirarchical namespace (data lake gen 2), you are placing file folders in the drawer, and placing papers inside those file folders.
Remember, file folders are also made of paper just like your documents, so if you ask "show me all the things made of paper in the drawer", you will get the folders too. However the folders don't contain any data, so they appear empty.
If you try to insert the file folder to the first case, it will appear beside the other document papers, but hold nothing inside.

Does this make sense? Or have I missed the mark?

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.