Indexers in Azure Cognitive Search
An indexer in Azure Cognitive Search is a crawler that extracts searchable content from cloud data sources and populates a search index using field-to-field mappings between source data and a search index. This approach is sometimes referred to as a 'pull model' because the search service pulls data in without you having to write any code that adds data to an index. Indexers also drive the AI enrichment capabilities of Cognitive Search, integrating external processing of content en route to an index.
Indexers are cloud-only, with individual indexers for supported data sources. When configuring an indexer, you'll specify a data source (origin) and a search index (destination). Several sources, such as Azure Blob Storage, have more configuration properties specific to that content type.
You can run indexers on demand or on a recurring data refresh schedule that runs as often as every five minutes. More frequent updates require a 'push model' that simultaneously updates data in both Azure Cognitive Search and your external data source.
Indexer scenarios and use cases
You can use an indexer as the sole means for data ingestion, or in combination with other techniques. The following table summarizes the main scenarios.
|Single data source||This pattern is the simplest: one data source is the sole content provider for a search index. Most supported data sources provide some form of change detection so that subsequent indexer runs pick up the difference when content is added or updated in the source.|
|Multiple data sources||An indexer specification can have only one data source, but the search index itself can accept content from multiple sources, where each indexer run brings new content from a different data provider. Each source can contribute its share of full documents, or populate selected fields in each document. For a closer look at this scenario, see Tutorial: Index from multiple data sources.|
|Multiple indexers||Multiple data sources are typically paired with multiple indexers if you need to vary run time parameters, the schedule, or field mappings. Cross-region scale out of Cognitive Search is another scenario. You might have copies of the same search index in different regions. To synchronize search index content, you could have multiple indexers pulling from the same data source, where each indexer targets a different search index in each region.Parallel indexing of very large data sets also requires a multi-indexer strategy, where each indexer targets a subset of the data.|
|Content transformation||Indexers drive AI enrichment. Content transforms are defined in a skillset that you attach to the indexer.|
Supported data sources
Indexers crawl data stores on Azure and outside of Azure.
- Azure Blob Storage
- Azure Cosmos DB
- Azure Data Lake Storage Gen2
- Azure SQL Database
- Azure Table Storage
- Azure SQL Managed Instance
- SQL Server on Azure Virtual Machines
- Azure Files (in preview)
- Azure MySQL (in preview)
- SharePoint in Microsoft 365 (in preview)
- Azure Cosmos DB for MongoDB (in preview)
- Azure Cosmos DB for Apache Gremlin (in preview)
Indexers accept flattened row sets, such as a table or view, or items in a container or folder. In most cases, it creates one search document per row, record, or item.
Indexer connections to remote data sources can be made using standard Internet connections (public) or encrypted private connections when you use Azure virtual networks for client apps. You can also set up connections to authenticate using a managed identity. For more information about secure connections, see Indexer access to content protected by Azure network security features and Connect to a data source using a managed identity.
Stages of indexing
On an initial run, when the index is empty, an indexer will read in all of the data provided in the table or container. On subsequent runs, the indexer can usually detect and retrieve just the data that has changed. For blob data, change detection is automatic. For other data sources like Azure SQL or Azure Cosmos DB, change detection must be enabled.
For each document it receives, an indexer implements or coordinates multiple steps, from document retrieval to a final search engine "handoff" for indexing. Optionally, an indexer also drives skillset execution and outputs, assuming a skillset is defined.
Stage 1: Document cracking
Document cracking is the process of opening files and extracting content. Text-based content can be extracted from files on a service, rows in a table, or items in container or collection. If you add a skillset and image skills, document cracking can also extract images and queue them for image processing.
Depending on the data source, the indexer will try different operations to extract potentially indexable content:
When the document is a file with embedded images, such as a PDF, the indexer extracts text, images, and metadata. Indexers can open files from Azure Blob Storage, Azure Data Lake Storage Gen2, and SharePoint.
When the document is a record in Azure SQL, the indexer will extract non-binary content from each field in each record.
When the document is a record in Azure Cosmos DB, the indexer will extract non-binary content from fields and subfields from the Azure Cosmos DB document.
Stage 2: Field mappings
An indexer extracts text from a source field and sends it to a destination field in an index or knowledge store. When field names and data types coincide, the path is clear. However, you might want different names or types in the output, in which case you need to tell the indexer how to map the field.
To specify field mappings, enter the source and destination fields in the indexer definition.
Field mapping occurs after document cracking, but before transformations, when the indexer is reading from the source documents. When you define a field mapping, the value of the source field is sent as-is to the destination field with no modifications.
Stage 3: Skillset execution
Skillset execution is an optional step that invokes built-in or custom AI processing. Skillsets can add optical character recognition (OCR) or other forms of image analysis if the content is binary. Skillsets can also add natural language processing. For example, you can add text translation or key phrase extraction.
Whatever the transformation, skillset execution is where enrichment occurs. If an indexer is a pipeline, you can think of a skillset as a "pipeline within the pipeline".
Stage 4: Output field mappings
If you include a skillset, you'll need to specify output field mappings in the indexer definition. The output of a skillset is manifested internally as a tree structure referred to as an enriched document. Output field mappings allow you to select which parts of this tree to map into fields in your index.
Despite the similarity in names, output field mappings and field mappings build associations from different sources. Field mappings associate the content of source field to a destination field in a search index. Output field mappings associate the content of an internal enriched document (skill outputs) to destination fields in the index. Unlike field mappings, which are considered optional, an output field mapping is required for any transformed content that should be in the index.
The next image shows a sample indexer debug session representation of the indexer stages: document cracking, field mappings, skillset execution, and output field mappings.
Indexers can offer features that are unique to the data source. In this respect, some aspects of indexer or data source configuration will vary by indexer type. However, all indexers share the same basic composition and requirements. Steps that are common to all indexers are covered below.
Step 1: Create a data source
Data sources are independent objects. Multiple indexers can use the same data source object to load more than one index at a time.
Step 2: Create an index
An indexer will automate some tasks related to data ingestion, but creating an index is generally not one of them. As a prerequisite, you must have a predefined index that contains corresponding target fields for any source fields in your external data source. Fields need to match by name and data type. If not, you can define field mappings to establish the association. For more information about structuring an index, see Create an Index (REST) or SearchIndex class.
Although indexers cannot generate an index for you, the Import data wizard in the portal can help. In most cases, the wizard can infer an index schema from existing metadata in the source, presenting a preliminary index schema which you can edit in-line while the wizard is active. Once the index is created on the service, further edits in the portal are mostly limited to adding new fields. Consider the wizard for creating, but not revising, an index. For hands-on learning, step through the portal walkthrough.
Step 3: Create and run (or schedule) the indexer
By default, the first indexer execution occurs when you create an indexer on the search service. You can set the "disabled" property in an indexer to create it without running it.
Any errors or warnings about data access or skillset validation will occur during indexer execution. Until indexer execution starts, dependent objects such as data sources and skillsets are passive on the search service.
Indexers don't have dedicated processing resources. Based on this, indexers' status may show as idle before running (depending on other jobs in the queue) and run times may not be predictable. Other factors define indexer performance as well, such as document size, document complexity, image analysis, among others.
Now that you've been introduced to indexers, a next step is to review indexer properties and parameters, scheduling, and indexer monitoring. Alternatively, you could return to the list of supported data sources for more information about a specific source.
Submit and view feedback for