Microsoft Purview Data Catalog Responsible AI FAQ

What is Microsoft Purview Data Catalog?

The Data Catalog is a Data Governance solution that enables business experts and technical data owners to collaborate and contribute to a shared understanding of data. Data Catalog enables inventory of metadata and provides a framework for the governance of metadata and the use of the underlying data. Data Catalog helps organizations create value from data, manage access to data, improve data quality, and monitor data health.

What are the Microsoft Purview Data Catalog's intended uses?

The intent of the Data Catalog is to help companies create value from their data and scale the management and governance of data. Data owners and stewards can create business concepts and data products to manage and make available their most valuable data. They can apply data quality rules to continually improve quality of the data. Data consumers such as data analysts, data scientists, and business users can discover data products within their enterprise data catalog and get access to the data for business decision making. Data officers can drive value creation from the data estate while applying common controls with federated accountability to ensure data is healthy and safe.

AI features in Data Catalog can help data professionals such as data owners and stewards, data consumers, and data officers to create value from their data while managing and governing their data following data governance best practices.

Natural language data product search (Preview)

Data consumers can use natural language to search for data products within its organization's data catalog. They can quickly find data products for business purpose by describing what data they are looking for and the intended use. For example, "I need three years of revenue data from the finance department to analyze sales trends". Natural language search returns a list of data products within the organization's data catalog which best matches user's needs.

Natural language search currently supports data products created in English, French and Spanish languages and search terms in the corresponding languages.

Get suggestions for Data Assets Mapping (Preview)

When creating a data product, you need to add associated data assets from data map. You can browse and select data assets manually or accelerate the process by using the AI system to auto-suggest data assets. The AI system provides suggestions using metadata of the data product such as name, description, and use cases. It searches for relevant data assets within the data map and returns a prioritized list of 10 data assets best fit the needs. You can then choose to select and add one or more data assets to your data product.

The AI output includes only data assets which you have read permission to in the data map. If you have many highly similar data assets, consider applying more metadata such as detailed description to them.

Get suggested Data Quality Rules (Preview)

Data quality rules can help improve quality of the data. You can create data quality rules and apply to data assets manually or accelerate the process by using the AI system to auto-generate rules. The AI system suggests data quality rules based on metadata collected from profiling of the data assets. Data profiling collects a statistical representation of the data in a data asset by evaluating the values contained in the data. You can then choose to select and add one or more data quality rules to your data asset.

The AI generated data quality rules are meant to be used as a starting point. You should always review and test the auto-generated data quality rules prior to incorporating it.

Get suggested Glossary Terms (Private Preview)

Business glossary terms enhance knowledge and context of data products. You can create glossary terms and add to a governance domain manually or accelerate the process by using the AI system to auto-generate glossary terms. The AI system provides suggestions based on metadata of the governance domain such as name and description and general internet knowledge. You can then choose to select and add one or more auto-generated glossary terms to your governance domain.

You should review the AI-generated glossary terms for content accuracy, relevance, format, and make sure it meets your organization's specific needs prior to adding it to governance domain. If the AI generated glossary terms are too general, try improving the governance domain description such as adding industry and segment information or intent. You can also use text inputs to guide the AI system.

This AI feature is in private preview and only available to select early access customers.

What are the limitations of AI features in Data Catalog? How can users minimize the impact of limitations when using the system?

  • Preview features aren't meant for production use and might have limited functionality.

  • Like any AI-powered technology, AI features in Data Catalog doesn't get everything right. However, you can help improve its responses by providing feedback.

  • AI features in Data Catalog are designed for data governance scenarios. Use of AI features outside the scope of data governance might result in responses that lack accuracy and comprehensiveness.

  • In scenarios where user text inputs are allowed, the system might not be able to process long prompts, such as hundreds of thousands of characters.

How are AI features in Data Catalog evaluated? What metrics are used to measure performance?

AI features in Data Catalog underwent various testing prior to being released. Testing included red teaming, which is the practice of testing the product to identify failure modes and scenarios that might cause AI to do or say things outside of its intended uses or that don't support the Microsoft AI Principles.

User feedback is critical in helping Microsoft improve the system. You have the option of providing feedback on if the AI generated response is useful and accurate, or inaccurate, incomplete, or unclear. Your feedback goes directly to Microsoft to help us improve the product's performance.

What operational factors and settings allow for effective and responsible use of AI features in Data Catalog?

  • You should always review AI generated responses.

  • In scenarios where text inputs are supported, you can use natural language to describe what you like the AI feature to do.

  • You can provide feedback about a response's quality, including reporting anything unacceptable to Microsoft.