Data discovery - Finding data for AI projects

Article
03/30/2024

This article discusses common data discovery challenges related to finding data for use in AI projects.

Discovery challenges for data consumers

Traditionally, discovering enterprise data sources has been an organic process based on communal knowledge. For companies that want the most value from their information assets, this approach presents many challenges:

Because there's no central location to register data sources, users might be unaware of a data source unless they come into contact with it as part of another process.
Unless users know the location of a data source, they can't connect to the data by using a client application. Data-consumption experiences require users to know the connection string or path.
The intended use of the data is hidden to users unless they know the location of a data source's documentation. Data sources and documentation might live in several places and be consumed through different kinds of experiences.
If users have questions about an information asset, they must locate the expert or team responsible for that data and engage them offline. There's no explicit connection between the data and the experts that understand the data's context.
Unless users understand the process for requesting access to the data source, discovering the data source and its documentation won't help them access the data.

Discovery challenges for data producers

Although data consumers face the previously mentioned challenges, users who are responsible for producing and maintaining information assets face challenges of their own:

Annotating data sources with descriptive metadata is often a lost effort. Client applications typically ignore descriptions that are stored in the data source.
Creating documentation for data sources can be difficult and it's an ongoing responsibility to keep documentation in sync with data sources. Users might not trust documentation that's perceived as being out of date.
Creating and maintaining documentation for data sources is complex and time-consuming. Making that documentation readily available to everyone who uses the data source can be even more so.
Restricting access to data sources and ensuring that data consumers know how to request access is an ongoing challenge.
All these challenges present a significant barrier for companies that want to encourage and promote the use of enterprise data.

Discovery challenges for security administrators

Users who are responsible for ensuring the security of their organization's data may have any of the challenges listed above as data consumers and producers, and the following extra challenges:

An organization's data is constantly growing and being stored and shared in new directions. The task of discovering, protecting, and governing your sensitive data is one that never ends. You need to ensure that your organization's content is being shared with the correct people, applications, and with the correct permissions.
Understanding the risk levels in your organization's data requires diving deep into it, looking for keywords, RegEx patterns, and sensitive data types. For example, sensitive data types might include Credit Card numbers, Social Security numbers or Bank Account numbers. You must constantly monitor all data sources for sensitive content, as even the smallest amount of data loss can be critical to your organization.
Ensuring compliance with corporate security policies is a challenge for organizations. As their content grows and the policies are updated to address evolving digital realities, Security administrators need to ensure data security as quickly as possible.
Ensuring compliance with corporate security policies can be a challenge for organizations. As their content grows and the policies are updated to address evolving digital realities. Security administrators need to ensure data security in the quickest time possible.

Microsoft Purview provides capabilities to help address these challenges.

Data discovery from an MLOps perspective

The data discovery process can be enhanced by using ML. By using ML techniques, data discovery can become smart, can discover relationships between data and accelerate an organization's understanding of their data.

Coupled with visualizations, data analysts and business domain experts can quickly derive insights from previously unexplored data.

Below are some examples of how ML can be used to address the typical challenges during the data discovery process:

Propose data preparation steps such as normalization and handling of missing data
Infer relationships between unstructured data types such as documents, video and images, which are difficult to work with
Detect Personal Identifiable Information (PII) and other types of sensitive data
Enrich and index data meaningfully for users to easily search and discover on their own.
Perform automatic data translation so that it is accessible to users in different languages
Identify outliers and patterns in the data
Detect anomalies in the data
Improve understanding of behavioral data - users and customers. Generate subsequential recommendations
Discover more data sources, which may be useful to ML practitioners

Some useful resources for data discovery:

Resource	Description
Data Discovery Toolkit - Unstructured data	A repository containing guidance and code assets that use various Machine Learning techniques to discover insights in unstructured data such as documents, images and videos.
AI Enrichment Pipeline tutorial	A complete sample for processing text, image and video files through a full enrichment pipeline with event grid, service bus, functions, logic apps, cognitive services and video indexer.
Azure AI Search - An AI-first approach to content understanding	This project demonstrates how you can use both the built-in and custom AI in AI Search. AI Search ingests your data from almost any data source. Then enriches it using a set of cognitive skills that extract knowledge and then lets you explore the data using Search.
Azure AI Search Powerskills	Power Skills are a collection of useful functions to be deployed as custom skills for Azure AI Search.
Microsoft Presidio	Presidio can help identify sensitive/PII data in un/structured text.
PII Detection Cognitive Skill	The PII Detection skill extracts personal information from an input text and gives you the option of masking it.
End to end Knowledge Mining for Video	A video discovery pipeline that includes Azure Search and user feedback

ℹ️ Refer to the Data: Data Discovery section for more information nuanced to a Data Engineer/Governance role.

Share via