Is it possible to catalog data inside csv files inside Azure Blob Storage using Azure Data Catalog?

Muneeb Mirza 41 Reputation points
2021-06-28T10:43:12.617+00:00

I want to catalog data stored in csv files in the Azure Blob Storage. I tried to see if there is anyway to get metadata of Blob Storage and found Data Catalog is an option. Thing is, csv file is handled as a blob type and we can not profile it. I want, csv files in blob storage to act as tables.

Is this possible using Azure Data Catalog?

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,200 questions
0 comments No comments
{count} votes

Accepted answer
  1. Sumarigo-MSFT 47,471 Reputation points Microsoft Employee Moderator
    2021-07-02T09:13:20.163+00:00

    @Muneeb Mirza Are you saying that the CSV contains, essentially, a pointer to a dataset they want to catalog?

    For example, the CSV might look like…

    Server,schema,database,table,tableDescription
    Server1,dbo,mydb,table0001,”This is a really great table!”

    If so, you’ll have to write some code on your own to parse the data and conform it to the Atlas Entity REST API calls.
    You might take inspiration from the work I’ve done on PyApacheAtlas: pyapacheatlas/reader.py at master · wjohnson/pyapacheatlas (github.com)

    Alternatively, you might work with them to restructure the data (in a Python program) to conform to the templates that I provided. The required headers are:
    • "typeName"
    • "name"
    • "qualifiedName"
    • "classifications"

    If you have that information in a list of dictionaries in Python, you could use Reader.parse_bulk_entities(list_of_dicts) and then take the results and call client.upload_entities(results) and bulk upload the entities inside that csv.

    Hope this helps!

    Kindly let us know if the above helps or you need further assistance on this issue.

    ----------------------------------------------------------------------------------------------------------------------------------------

    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.


1 additional answer

Sort by: Most helpful
  1. Sumarigo-MSFT 47,471 Reputation points Microsoft Employee Moderator
    2021-06-29T10:15:08.593+00:00

    @Muneeb Mirza Firstly, apologies for the delay in responding here and any inconvenience this issue may have caused.

    Yes you can use Data Catalog, For updated Data Catalog features, please use the new Azure Purview service, which offers unified data governance for your entire data estate. I would recommend to use : Azure Purview( Still you possible through Data Catalog)

    Registering assets from a data source copies the assets’ metadata to Azure, but the data remains in the existing data-source location.

    For updated Data Catalog features, please use the new Azure Purview service, which offers unified data governance for your entire data estate.
    Introduction to Azure Purview (preview) - Azure Purview
    This article provides an overview of Azure Purview, including its features and the problems it addresses. Azure Purview enables any user to register, discover, understand, and consume data sources.

    This article outlines how to register an Azure Blob Storage account in Purview and set up a scan.

    • Fore more information on Blob index tags categorize data in your storage account using key-value tag attributes. These tags are automatically indexed and exposed as a searchable multi-dimensional index to easily find data. This article shows you how to set, get, and find data using blob index tags. Use blob index tags to manage and find data on Azure Blob Storage Hope this helps!

    Kindly let us know if the above helps or you need further assistance on this issue.


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.