Data flow sink step in azure data factory create empty blob alongside the folder

Chaoyue 1 Reputation point
2021-10-29T08:44:55.657+00:00

Hi, When I use dataflow supported by azure, in the sink step it will create an extra empty file alongside the folder, which is not what I want, please see screen below.

144961-empty-file.png

I searched and found there are similar questions raised by others and those questions haven't been resolved yet.
https://learn.microsoft.com/en-us/answers/questions/382374/writing-to-parquet-creates-empty-blob.html
https://stackoverflow.com/questions/68662065/extra-blob-created-after-sink-in-data-flow

I think this could be some bug here, not sure the inner implementation of data flow but I found an issue in adlfs and it has been closed in v0.6.3. Not sure whether it could help but I just paste the link here: https://github.com/dask/adlfs/issues/137

My ask here is whether we can fix this empty blob get created issue, and if it cannot be fixed, any workaround here?

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,426 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,527 questions
{count} votes

1 answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,021 Reputation points
    2021-10-29T21:07:13.193+00:00

    Hello @Chaoyue and welcome to Microsoft Q&A. Please allow me to explain what I know related to this issue. This starts with elaborating on the difference a Hierarchical Namespace makes.

    BLOB storage and Data Lake Gen 2 are very similar. Data Lake Gen 2 is the result of enabling the hierarchical namespace feature on the storage account. The effect is to enhance the blob storage so that folders are no longer virtual, but actually have an entry. The entry, when viewed via blob driver, presents as an empty blob.

    An empty blob can be created when you try to use a Data Lake Gen 2 driver to write to blob storage (without hierarchical namespace). In an attempt to make the new folder, this empty blob is created.

    To better visualize, imagine a filing cabinet. The filing cabinet is your storage account. The drawers are the containers.
    Without heirarchical namespace (blob storage), you are placing papers directly into the drawer. The papers might be sorted and have parts of their names declaring some papers belonging together , but they are still loose.
    With heirarchical namespace (data lake gen 2), you are placing file folders in the drawer, and placing papers inside those file folders.
    Remember, file folders are also made of paper just like your documents, so if you ask "show me all the things made of paper in the drawer", you will get the folders too. However the folders don't contain any data, so they appear empty.
    If you try to insert the file folder to the first case, it will appear beside the other document papers, but hold nothing inside.

    Does this make sense? Or have I missed the mark?