This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.
This article presents a solution that enriches text and image documents by using image processing, natural language processing, and custom skills to capture domain-specific data. Azure Cognitive Search with AI enrichment can help identify and explore relevant content at scale. This solution uses AI enrichment to extract meaning from the original complex, unstructured JFK Assassination Records (JFK Files) dataset.
Download a Visio file of this architecture.
The above diagram illustrates the process of passing the unstructured JFK Files dataset through the Azure Cognitive Search skills pipeline to produce structured, indexable data:
- Unstructured data in Azure Blob Storage, such as documents and images, ingest into Azure Cognitive Search.
- The document cracking step initiates the indexing process by extracting images and text from the data, followed by content enrichment. The enrichment steps that occur in this process depend on the data and type of skills selected.
- Built-in skills based on the Computer Vision and Language Service APIs enable AI enrichments including image optical character recognition (OCR), image analysis, text translation, entity recognition, and full-text search.
- Custom skills support scenarios that require more complex AI models or services. Examples include Forms Recognizer, Azure Machine Learning models, and Azure Functions.
- Following the enrichment process, the indexer saves the outputs into a search index that contains the enriched and indexed documents. Full-text search and other query forms can use this index.
- The enriched documents can also project into a knowledge store, which downstream apps like knowledge mining or data science can use.
- Queries access the enriched content in the search index. The index supports custom analyzers, fuzzy search queries, filters, and a scoring profile to tune search relevance.
- Any application that connects to Blob Storage or to Azure Table Storage can access the knowledge store.
Azure Cognitive Search works with other Azure components to provide this solution.
Azure Cognitive Search
Azure Cognitive Search indexes the content and powers the user experience in this solution. Azure Cognitive Search can apply pre-built cognitive skills to the content, and the extensibility mechanism can add custom skills for specific enrichment transformations.
Azure Computer Vision
Azure Computer Vision uses text recognition to extract and recognize text information from images. The Read API uses the latest OCR recognition models, and is optimized for large, text-heavy documents and noisy images.
The legacy OCR API isn't optimized for large documents, but supports more languages. OCR results can vary depending on scan and image quality. The current solution idea uses OCR to produce data in the hOCR format.
Azure Cognitive Service for Language
Azure Cognitive Service for Language extracts text information from unstructured documents by using text analytics capabilities like Named Entity Recognition (NER), key phrase extraction, and full-text search.
Azure Blob Storage is REST-based object storage for data that you can access from anywhere in the world via HTTPS. You can use Blob Storage to expose data publicly to the world or to store application data privately. Blob Storage is ideal for large amounts of unstructured data like text or graphics.
Azure Table Storage stores highly available, scalable, structured or semi-structured NoSQL data in the cloud.
Azure Functions is a serverless compute service that lets you run small pieces of event-triggered code without having to explicitly provision or manage infrastructure. This solution uses an Azure Functions method to apply the CIA Cryptonyms list to the JFK Assassination Records as a custom skill.
Azure App Service
This solution idea also builds a standalone web app in Azure App Service to test, demonstrate, search the index, and explore connections in the enriched and indexed documents.
Large, unstructured datasets can include typewritten and handwritten notes, photos and diagrams, and other unstructured data that standard search solutions can't parse. The JFK Assassination Records contain over 34,000 pages of documents about the CIA investigation of the 1963 JFK assassination.
The JFK Files sample project and online demo showcase a particular Azure Cognitive Search use case. This solution idea isn't intended to be a framework or scalable architecture for all scenarios, but to provide a general guideline and example. The code project and demo create a public website and publicly readable storage container for extracted images, so you shouldn't use this solution with non-public data.
AI enrichment in Azure Cognitive Search can extract and enhance searchable, indexable text from images, blobs, and other unstructured data sources like the JFK Files. AI enrichment uses pre-trained machine learning skill sets from the Cognitive Services Computer Vision and Cognitive Service for Language APIs. You can also create and attach custom skills to add special processing for domain-specific data like CIA Cryptonyms. Azure Cognitive Search can then index and search that context.
The Azure Cognitive Search skills in this solution fall into the following categories:
Image processing. Built-in text extraction and image analysis skills include object and face detection, tag and caption generation, and celebrity and landmark identification. These skills create text representations of image content, which are searchable by using the query capabilities of Azure Cognitive Search. Document cracking is the process of extracting or creating text content from non-text sources.
Potential use cases
- Increase the value and utility of unstructured text and image content in search and data science apps.
- Use custom skills to integrate open-source, third-party, or first-party code into indexing pipelines.
- Make scanned JPG, PNG, or bitmap documents full-text searchable.
- Produce better outcomes than standard PDF text extraction for PDFs with combined image and text. Some scanned and native PDF formats might not parse correctly in Azure Cognitive Search.
- Create new information from inherently meaningful raw content or context that's hidden in larger unstructured or semi-structured documents.
This article is maintained by Microsoft. It was originally written by the following contributor.
- Carlos Alexandre Santos | Senior Specialized AI Cloud Solution Architect
To see non-public LinkedIn profiles, sign in to LinkedIn.
Learn more about this solution:
- Explore the JFK Files project on GitHub.
- Watch the process in action in an online video.
- Explore the JFK Files online demo.
Read product documentation:
- AI enrichment in Azure Cognitive Search
- What is Computer Vision?
- What is Azure Cognitive Service for Language?
- What is optical character recognition?
- What is Named Entity Recognition (NER) in Azure Cognitive Service for Language?
- Introduction to Azure Blob Storage
- Introduction to Azure Functions
Try the learning path:
See the related architectures and guidance: