Automating Export of Parquet Schema Elements from Purview Using PyApacheAtlas

Question

Automating Export of Parquet Schema Elements from Purview Using PyApacheAtlas

Janvi 0

How can the fully qualified name, classifications, sensitivity labels, glossary terms, and column descriptions for each column be exported from an Azure Data Lake Storage Gen2 Resource Set within a scanned collection?

After completing the scan, the goal is to download all attributes for specific assets into an Excel file. This would facilitate adding additional information and later uploading it back using PyApacheAtlas.

Efforts have been made to use get_entity, which requires a GUID for each asset, but manually retrieving this information is tedious and inefficient.

Is there a method to automate this process to export the required details for all assets into an Excel file for seamless updates and re-uploading? User's image

Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-01-15T18:04:06.1366667+00:00
Hi @Janvi
Thanks for the question and using MS Q&A platform.
To automate the process of exporting Parquet schema elements and associated metadata (such as fully qualified names, classifications, sensitivity labels, glossary terms, and column descriptions) from Purview using PyApacheAtlas, you can follow the below steps:

Scanning and Entity Retrieval - First, ensure that your Azure Data Lake Storage Gen2 resource, containing the Parquet files, is scanned in Azure Purview. Use PyApacheAtlas to retrieve the entities associated with the Parquet schema elements by querying Purview based on a specific collection or type (parquet_schema).

Extract Metadata and Export to Excel - Once you retrieve the entities, the next step is to extract the required metadata (e.g., fully qualified name, classifications, glossary terms, sensitivity labels, and column descriptions) for each asset. You can then export this information into an Excel file for easy updates.

Update Metadata in Excel - Once the metadata is exported into Excel, you can manually modify the file: Add or modify classifications, sensitivity labels, glossary terms, or column descriptions as needed.

Re-upload Updated Metadata - After making changes to the Excel file, you can re-upload the updated metadata back into Azure Purview using PyApacheAtlas.

Scheduling the Automation - Once this script is working, you can fully automate the process: Create a Python script that connects to Purview, extracts metadata, exports it to Excel, allows for manual updates, and then re-upload the updated data back to Purview. You can schedule the script to run at regular intervals (e.g., daily or weekly) using cron jobs or task scheduler, depending on your environment.

References:

Microsoft Purview disaster recovery, and migration best practices

Frequently asked questions (FAQ) about Microsoft Purview data governance solutions

Microsoft Purview governance services open-source tools and utilities

Asset management in the Microsoft Purview Data Catalog

Hope this helps. Do let us know if you have any further queries.
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-01-16T04:56:38.72+00:00

@Janvi - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-01-17T03:58:43.8066667+00:00

@Janvi - Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.
Janvi 0 Reputation points

2025-01-23T21:09:15.8533333+00:00

@Ganesh Gurram I was able to get qualifiedName. However I needed to provide guid for all the assets.

1 answer

Your answer

Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-01-16T04:56:38.72+00:00

@Janvi - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator

2025-01-17T03:58:43.8066667+00:00

@Janvi - Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.
Janvi 0 Reputation points

2025-01-23T21:09:15.8533333+00:00

@Ganesh Gurram I was able to get qualifiedName. However I needed to provide guid for all the assets.

Answer 1

@Janvi - Thanks for the update and for providing more context on your requirements.

I understand that manually retrieving GUIDs for each asset can be time-consuming and inefficient, so let's automate this process and make it easier for you to work with metadata.

Use search_entities to search for the relevant assets (e.g., Parquet schema elements). This will return a list of entities, each containing a GUID that you can use to fetch additional metadata.

For each entity, use get_entity to retrieve detailed metadata, such as the fully qualified name, classifications, sensitivity labels, glossary terms, and column descriptions.

Once the metadata is collected, export it to an Excel file for easy viewing and editing. This will allow you to update the metadata outside of Purview as needed.

After editing the metadata in the Excel file, you can create a script to read the updated file and use the update_entity method to upload the changes back to Purview.

Once the process is working, you can schedule the script to run automatically at regular intervals (e.g., daily or weekly) using cron jobs (Linux) or Task Scheduler (Windows), ensuring that the metadata is always up to date.

I hope this information helps.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Share via

Automating Export of Parquet Schema Elements from Purview Using PyApacheAtlas

1 answer

Your answer