Purview - Extracting MetaData

Rob Sharpe 20

Hi there,

I am looking to extract Metadata from a data source in Purview - its source is an AWS S3 bucket.

I have seen that I can extract the metadata using REST API's into a JSON format.

My question is - what language should I use to do this - I have basic python knowledge, so that might be an option.

Is there an option that I an use to do this or do I need to write something to extract the metadata ?

Thanks

PRADEEPCHEEKATLA 90,226 Reputation points

2024-05-27T04:53:52.3133333+00:00

@Rob Sharpe - Thanks for the question and using MS Q&A platform.

To extract metadata from an AWS S3 bucket data source in Microsoft Purview, you can use the Multi-Cloud Scanning Connector for Microsoft Purview. This connector allows you to explore your organizational data across cloud providers, including Amazon Web Services in addition to Azure storage services.

The Multi-Cloud Scanning Connector for Microsoft Purview uses a Microsoft account with secure access to AWS to read your data and report the scanning results back to Azure. You can then use the Microsoft Purview classification and labeling reports to analyze and review your data scan results.

To use the Multi-Cloud Scanning Connector for Microsoft Purview, you need to have a Microsoft Azure account and the connector needs to be set up and configured. You can find more information on how to set up and configure the connector in the following document: Amazon S3 multi-cloud scanning connector for Microsoft Purview.

If you prefer to use REST APIs to extract metadata from an AWS S3 bucket data source, you can use the AWS SDK for Python (Boto3) to interact with the AWS S3 API. You can find more information on how to use Boto3 to interact with the AWS S3 API in the following document: Boto3 documentation.

However, please note that using the Multi-Cloud Scanning Connector for Microsoft Purview is the recommended approach for extracting metadata from an AWS S3 bucket data source in Microsoft Purview. Hope this helps. Do let us know if you any further queries.
PRADEEPCHEEKATLA 90,226 Reputation points

2024-05-29T08:10:11.0533333+00:00

@Rob Sharpe - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Rob Sharpe 20 Reputation points

2024-05-29T08:17:04.7866667+00:00

Hi @PRADEEPCHEEKATLA ,

An update, actually another question or two .

(1) If I want to extract the metadata from Purview into a fie or set of files in order to then use a process ETL or ELT into Power BI, should I use the REST API's ? ,

(2) How would I get it to include ALL the metadata ?

(3) Could I programmatically limit it to just extract the data from S3, as there may be other sources used.

Thanks

Rob
PRADEEPCHEEKATLA 90,226 Reputation points

2024-05-30T07:38:41.5566667+00:00
@Rob Sharpe - Let me answer your questions more accurately:

To extract metadata from Purview into a file or set of files, you can use the Purview REST APIs. You can use any programming language that supports REST APIs to extract metadata from Purview. Python is a good option as it has libraries that can help you make REST API calls. You can use the metadata REST API to extract metadata from Purview. You can use the following REST API to extract metadata from Purview:

GET https://<account-name>.catalog.purview.azure.com/api/atlas/v2/search/basic?query=<query>&typeName=<type-name>&classification=<classification>&limit=<limit>&offset=<offset>

To get all the metadata, you can set the limit parameter to a high number, such as 1000. This will return up to 1000 results per page. You can then use the offset parameter to paginate through the results.

To programmatically limit the metadata extraction to just the data from S3, you can use the query parameter to filter the results. For example, you can use the following query to filter the results to only include assets that are sourced from S3:

Query=sourceType%3Ds3

This will return only the assets that are sourced from S3. You can modify the query to include other filters as needed.

Once you have extracted the metadata, you can use ETL or ELT processes to load the data into Power BI. You can use tools like Azure Data Factory or Azure Databricks to perform ETL or ELT processes.

I hope this helps! Let me know if you have any other questions.
Rob Sharpe 20 Reputation points

2024-05-30T08:26:40.15+00:00

@PRADEEPCHEEKATLA ,

thank you very much for your reply, that is what I thought. One thing - I cannot see the queries that you put in the answer 0:(

I am so sorry to trouble you, can you please put them in again please.

Thanks

Rob.
PRADEEPCHEEKATLA 90,226 Reputation points

2024-05-30T09:00:17.3766667+00:00

@Rob Sharpe - The query parameter in the Purview REST API allows you to filter the results based on specific criteria. In this case, you can use the query parameter to filter the results to only include assets that are sourced from S3.

The query parameter takes a string value that specifies the filter criteria. In the example I provided, the filter criteria is sourceType%3Ds3. This filter criteria specifies that the results should only include assets where the sourceType attribute is equal to s3.

The %3D in the filter criteria is an encoded form of the equals sign (=). This is because the = character has a special meaning in URLs, so it needs to be encoded to be used as part of a query parameter.

So, when you use the query parameter with the filter criteria sourceType%3Ds3, the Purview REST API will only return assets that are sourced from S3.

I hope this helps! Let me know if you have any other questions.
Rob Sharpe 20 Reputation points

2024-05-30T14:03:27.2333333+00:00

@PRADEEPCHEEKATLA

thank you for the explanation.

So looking at your information, many thanks, the GET I would need to use to get everything from an S3 source, with a limit of 1000 and an offset of 24 would be

GET https://<account-name>.catalog.purview.azure.com/api/atlas/v2/search/basic?query=sourceType%3Ds3&limit=1000&offset=24

I am sorry - I have looked through the documentation for this command and cannot find it.

thanks

Rob.
PRADEEPCHEEKATLA 90,226 Reputation points

2024-06-06T07:06:35.61+00:00

@Rob Sharpe - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Accepted answer

PRADEEPCHEEKATLA 90,226 Reputation points

2024-06-03T06:58:13.44+00:00
@Rob Sharpe - To extract metadata from a data source system into Microsoft Purview Data Map, you can use REST APIs. You can use any programming language that supports REST APIs to extract metadata from Purview. Python is a good option as it has libraries that can help you make REST API calls.

To extract metadata from Purview into a file or set of files, you can use the REST APIs. You can use the GET /catalog/dataAssets/{dataAssetId}/metadata API to get all the metadata for a specific data asset. You can also use the GET /catalog/dataAssets/{dataAssetId}/metadata/{nameSpace} API to get metadata for a specific namespace.

To programmatically limit the metadata extraction to just S3, you can use the GET /catalog/dataAssets/{dataAssetId}/metadata/{nameSpace} API and specify the S3 namespace. For example, if the S3 namespace is awsS3, you can use the following API call to get metadata for just the S3 data asset:

GET /catalog/dataAssets/{dataAssetId}/metadata/awsS3

I hope this helps! Let me know if you have any other questions.
Please sign in to rate this answer.

1 person found this answer helpful.
Rob Sharpe 20 Reputation points

2024-06-06T08:43:00.2766667+00:00

Hi @PRADEEPCHEEKATLA ,

I am happy with this answer - it gives me enough information to make a start on the process.

Thank you

Rob

PRADEEPCHEEKATLA 90,226 Reputation points

2024-06-06T09:02:25.5666667+00:00

@Rob Sharpe - Glad to know it helped. Please do continue to use MS Q&A platform for any question related to Azure!

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Purview - Extracting MetaData

0 additional answers

Your answer