Understanding data curation and management for AI projects
The role of data is vital to the success of an AI project solution and the quality of the model depends on the quality of the training data. Unlike traditional software development, production-level AI solutions couple software engineering with data-based models.
Before an AI-based solution can be implemented and operationalized for production, specific data fundamentals, known as DataOps, have to be in place. DataOps is a lifecycle approach to data management. It uses agile practices to orchestrate tools, code, and infrastructure to quickly deliver high-quality data with improved security.
Pragmatically stated, DataOps aims to reduce the lifecycle cost of data management while increasing and improving the value data generates for an organization. Similar to MLOPs, DataOps is concerned with the automation and testing of both code and data. Both share the goal of delivering business value from data.
There are large areas of overlap between DataOps and MLOps, but they are nuanced by role. For example, both data and AI engineers deal with synthesizing data, but their focuses differ. A data engineer may emphasize testing data privacy controls, while an AI engineer concentrates on training AI models that accurately represent real data and do not reveal personally identifiable information (PII).
💡Key outcomes of data curation:
- The sources of data have been identified and the data can be ingested for curation.
- Data sensitivity, security and privacy controls are in place to protect data, and project engineers can only access data for which they have permission.
- Normalization, transformation, and other required pre-processing have been at least identified, and potentially applied, to the data.
We will cover a few key areas of DataOps and their significance to MLOps.
Data governance
Data governance is a set of capabilities that ensure the availability, reliability, and security of high-quality data throughout its lifecycle. It involves implementing data controls to support business objectives. It also encompasses the people, processes and technologies required to discover, manage and protect internal data assets.
For more detail, refer to AI data governance
DevOps for data
DevOps can be defined as the union of people, process, and products to enable continuous delivery of value to the business.
DevOps for data is concerned with the orchestration and automation of data pipelines that convert raw data to data of value to an organization.
ℹ️ For more detail, refer to the Data: DevOps for Data section.
Data requirements for an AI project
Before an AI project can be undertaken, and any fundamental data requirements are implemented, affirmative answers are required for each of the following questions:
- Can the data be accessed both for training and inference purposes?
- Is the data siloed or centralized?
- Is only the correct data accessible? Are data sensitivity and privacy controls in place?
- Is the data representative of the problem to be solved?
- Is a Subject Matter Expert (SME) who understands the data available throughout the project?
- Does the project team understand the data?
Data engineering
Data engineering relates to all aspects of data normalization, pre-processing and other forms of data preparation. It is a prerequisite for an ML project to be successful.
For more detail, refer to Data engineering for AI projects
Data labeling
Data labeling is the process of adding metadata and information to existing data.
For more detail, see Labeling AI data
Data versioning
Data versioning tracks datasets as they evolve over time, allowing us to identify what version of a dataset was used to train a given model and thus enabling reproducibility.
For more detail, see Exploratory data analysis: Data versioning
Data privacy
Data privacy ensures that sensitive data is properly managed and governed.
For more detail, refer to AI data privacy
Data security
Data Security Governance (DSG) contains all policies and processes, which deal specifically with protecting corporate data (in both structured database and unstructured file-based forms). It greatly affects the ability to access the data on which the model will be trained.
For more detail, see Data privacy and security for AI
Customer readiness: Data curation
Data curation is an important responsibility for data engineers, who must clean raw data ingested from multiple sources. Typical data problems include duplicates, invalid data, inconsistency, non-compliance with standard formats, and the presence of PII data.
Having high confidence in data quality and compliance with data access rules is crucial for the success of AI projects.
Data engineers find that different customers have different levels of maturity when it comes to data curation. Engineers should assess each customer's capability to identify and classify data, to assess data quality, and to protect data security. If a customer understand these aspects of data that has been ingested, or if they are still developing this capability, data engineers should work with the customer to verify data quality and security before using it for the AI project. If a customer has low confidence on the available data, there is risk that AI models will not deliver expected or usable results.
Standardizing data
Data from multiple sources can be in different formats. For example, dates can be in dd/mm/yyyy format from one source and mm/dd/yy from another. It is important that customer and their engineers are able to provide insights that enable data to be standardized.
Identifying PII
Some data may contain personally identifiable information (PII) about consumers, customers, or employees. This data should be obfuscated, masked and/or even encrypted. Customers should be able to identify their PII data and obfuscate/mask it before the source data are ingested for use as part of the AI project. Customers might be able to generate synthetic data to replace PII data.
For for more information, see Obfuscation and Masking.
Data integrity and uniformity
Rules for data uniformity and correctness are important aspects of customer readiness. Customers should be able to determine rules for uniformity. For example, 'United states' from one source can be mapped to 'US' from another source and vice-versa.
The customer should be aware of data domain integrity rules to ensure that the data adheres to data ranges, uniqueness and relational integrity while performing the data curation exercise.