Engineering usable data for AI projects

Data must be carefully engineered if an AI project is to be successful. Data engineering includes data normalization, pre-processing, enrichment and other forms of data preparation.

Data ingestion

Data ingestion is the process of extracting data from one or multiple sources and then preparing it for training an ML model. This process can be quite time intensive if done manually and/or if dealing with substantial amounts of data. It includes both structured and unstructured data in different formats from varying source types.

ℹ️ Refer to the Data: Data Ingestion section for more information nuanced to a Data Engineer/Governance role.

Data discovery

Data discovery may be defined as the collection and evaluation of data from disparate sources with the goal of deriving business value from that data.

For more information, see Data discovery: Finding data sources for AI projects.

Data enrichment

Data enrichment is a general term that refers to processes used to enhance, refine or otherwise improve raw data. Within the context of MLOps, we will refer to the process of enriching data using ML models and techniques.

Case studies: Aggregating data for AutoML Image Object Detection

In order to train a computer vision model, such as AutoML Image Object Detection, you require labeled training data. The images needs to be uploaded to the Azure Blob Storage and label annotations (of each image) need to be in a JSONL format. Once all images are labeled, you can perform data aggregation which will transform these multiple label annotations into a single JSON file, which you can use to create an MLTable that will serve as data input of this Azure ML model.

During this step, data aggregation helps to transform raw data into meaningful and useful information that can be used to train, test, and evaluate the machine learning model.

Refer to the AML v2 P&ID symbol detection train sample project as a sample implementation of an AML workflow to train a symbol detection model. This includes a data aggregation step for transforming the stored image and label datasets into a format that the AutoML training job can consume.

Case studies: Normalizing data for Form Recognizer

In the following example, scanned images require enrichment to handle noise or poor quality scans that impact the usability of an OCR model. Quality of the images will also clearly impact the accuracy of the extraction of data from the form. Data normalization is part of the process of preparing data for training an ML model. This process transforms raw ingested data into a consistent format that is readily consumable by downstream steps/tasks and processes.

During this step, the data is normalized, de-noised for improved results, and evaluated for techniques on how the data can be segmented.

Refer to the Pre-Processing section of the auto-labeling section of the forms knowledge extraction repo

Case studies: Enriching data

The following repository showcases a collection of small and discrete data enrichment functions, using various infrastructures. The functions are built for Azure AI Search, but they can be used in any data enrichment pipeline. These PowerSkills contain standard API interfaces so they can be consistently consumed.

Refer to the Azure Search PowerSkills for assets, guidance and examples.