Automating Export of Parquet Schema Elements from Purview Using PyApacheAtlas

Janvi 0 Reputation points
2025-01-14T18:15:17.4566667+00:00

How can the fully qualified name, classifications, sensitivity labels, glossary terms, and column descriptions for each column be exported from an Azure Data Lake Storage Gen2 Resource Set within a scanned collection?

After completing the scan, the goal is to download all attributes for specific assets into an Excel file. This would facilitate adding additional information and later uploading it back using PyApacheAtlas.

Efforts have been made to use get_entity, which requires a GUID for each asset, but manually retrieving this information is tedious and inefficient.

Is there a method to automate this process to export the required details for all assets into an Excel file for seamless updates and re-uploading? User's image

Microsoft Security | Microsoft Purview
{count} votes

1 answer

Sort by: Most helpful
  1. Ganesh Gurram 7,295 Reputation points Microsoft External Staff Moderator
    2025-01-24T06:59:33.56+00:00

    @Janvi - Thanks for the update and for providing more context on your requirements.

    I understand that manually retrieving GUIDs for each asset can be time-consuming and inefficient, so let's automate this process and make it easier for you to work with metadata.

    Use search_entities to search for the relevant assets (e.g., Parquet schema elements). This will return a list of entities, each containing a GUID that you can use to fetch additional metadata.

    For each entity, use get_entity to retrieve detailed metadata, such as the fully qualified name, classifications, sensitivity labels, glossary terms, and column descriptions.

    Once the metadata is collected, export it to an Excel file for easy viewing and editing. This will allow you to update the metadata outside of Purview as needed.

    After editing the metadata in the Excel file, you can create a script to read the updated file and use the update_entity method to upload the changes back to Purview.

    Once the process is working, you can schedule the script to run automatically at regular intervals (e.g., daily or weekly) using cron jobs (Linux) or Task Scheduler (Windows), ensuring that the metadata is always up to date.

    I hope this information helps.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.