Data Discovery solution for unstructured data

Article
03/23/2024

Every machine learning project requires a deep understanding of the data. Successful AI solutions need training data that is representative of real-world data. Data must also be sufficient to deliver the desired outcome. As AI teams learn about the data, they get a sense of what outcomes are possible and which approaches to follow on the project.

AI teams learn about data during Exploratory Data Analysis (EDA). During this phase, the data is cleaned, outliers are identified, and the suitability of the data is assessed. This work helps teams create initial hypotheses and plan experiments to test them.

Business problem

Customers have large volumes of unstructured data such as documents, scanned forms, videos and images that contain a wealth of information. Performing EDA on such unstructured data is very challenging and time consuming. There are many approaches to follow, and many tools to utilize. Exploring and analyzing data can easily involve a data science team for weeks or months. Using the Data Discovery Toolkit which uses various Machine Learning techniques to discover insights in unstructured data helps streamline EDA for unstructured data.

Solution

Unstructured data consists of artifacts like documents, images and videos. Examples include content from news and social media sites and legal discovery content.

The repo provides well-documented code that is ready to use.

Value proposition

The Data Discovery toolkit provides value in many areas:

Accelerates time to value:

Accelerating the Exploratory Data Analysis (EDA) phase.
Validating that the data is representative of the business problem.
Rapidly providing a labeled dataset for ML experimentation.

Enables EDA at scale:

Providing Azure Synapse Notebooks to allow performant access to large volumes of data by taking advantage of the in-memory and distributed nature of Azure Synapse and Spark.
For scaling MLOps best practices gained from multiple deployments and tracks all experiments, parameters and hyperparameters.
Commoditizing common data science functions for consistency and in cases where no data scientist is available.

Makes tools and analysis accessible by:

Allowing a domain expert or product owner to rapidly access the data to see broader patterns and insights.
Allowing non-technical interactive access to the data.
Facilitating communication between customer and project teams.

Logical architecture

The Data Discovery toolkit breaks EDA down into a sequence of easily understood tasks, as shown in the following logical architecture diagram.

Data Discovery Process

Implementation

The Data Discovery toolkit is provided as a GitHub repo that you can clone and begin using:

:fontawesome-brands-github: View GitHub repo{ .md-button .md-button--primary }

The Data Discovery Solution provides code to quickly discover data. This discovery is usually part of the Exploratory Data Analysis phase of the project. The overall approach is to take a large unstructured dataset that has no labels available. Then, to iterate over the data using various techniques to aggregate, cluster, and ultimately label the data in a cost effective and timely manner. Labeling is achieved by using these processes:

Unsupervised ML clustering algorithms
Heuristic approaches
Direct input and validation by a domain expert

Asking questions of the data in natural language is also possible, if text based, using semantic search features in Azure AI Search.

By combining these approaches, you can apply structure and labels to large datasets so the data may either be indexed for:

Discovery via a search solution such as Azure AI Search.
Training a supervised ML model to be trained so that future unseen data can be classified accordingly.

The following list illustrates this approach at a high level for a text-based problem where large amounts of unstructured data exist.

Cluster and explore the data quickly in the generated interactive Power BI report.
Ask specific questions of your data from within the Synapse notebooks using Azure AI Search and Azure SynapseML.
Assess the data to determine whether some simple heuristics may be applied to classify the data with a semantically relevant term (see the Heuristics notebook).
Apply the heuristic classification to the underlying data and remove the data from the larger corpus that could be classified.
Run text clustering in the remaining data and generate word clouds; iterate until an ideal number of clustered data appears and the clusters make sense to a domain expert.
Domain expert assesses the word clouds in more detail, and makes obvious corrections to word clouds by programmatically moving terms between clusters.
Domain expert labels the clusters with a semantically relevant term, which is programmatically propagated to the underlying records within the dataset.
Merge steps 1 and 6, which now allows for a classification model to be trained.

PowerBI Dashboard

Share via