Indexes - Create Or Update
Creates a new search index or updates an index if it already exists.
PUT {endpoint}/indexes('{indexName}')?api-version=2024-07-01
PUT {endpoint}/indexes('{indexName}')?allowIndexDowntime={allowIndexDowntime}&api-version=2024-07-01
URI Parameters
Name | In | Required | Type | Description |
---|---|---|---|---|
endpoint
|
path | True |
string |
The endpoint URL of the search service. |
index
|
path | True |
string |
The definition of the index to create or update. |
api-version
|
query | True |
string |
Client Api Version. |
allow
|
query |
boolean |
Allows new analyzers, tokenizers, token filters, or char filters to be added to an index by taking the index offline for at least a few seconds. This temporarily causes indexing and query requests to fail. Performance and write availability of the index can be impaired for several minutes after the index is updated, or longer for very large indexes. |
Request Header
Name | Required | Type | Description |
---|---|---|---|
x-ms-client-request-id |
string uuid |
The tracking ID sent with the request to help with debugging. |
|
If-Match |
string |
Defines the If-Match condition. The operation will be performed only if the ETag on the server matches this value. |
|
If-None-Match |
string |
Defines the If-None-Match condition. The operation will be performed only if the ETag on the server does not match this value. |
|
Prefer | True |
string |
For HTTP PUT requests, instructs the service to return the created/updated resource on success. |
Request Body
Name | Required | Type | Description |
---|---|---|---|
fields | True |
The fields of the index. |
|
name | True |
string |
The name of the index. |
@odata.etag |
string |
The ETag of the index. |
|
analyzers | LexicalAnalyzer[]: |
The analyzers for the index. |
|
charFilters | CharFilter[]: |
The character filters for the index. |
|
corsOptions |
Options to control Cross-Origin Resource Sharing (CORS) for the index. |
||
defaultScoringProfile |
string |
The name of the scoring profile to use if none is specified in the query. If this property is not set and no scoring profile is specified in the query, then default scoring (tf-idf) will be used. |
|
encryptionKey |
A description of an encryption key that you create in Azure Key Vault. This key is used to provide an additional level of encryption-at-rest for your data when you want full assurance that no one, not even Microsoft, can decrypt your data. Once you have encrypted your data, it will always remain encrypted. The search service will ignore attempts to set this property to null. You can change this property as needed if you want to rotate your encryption key; Your data will be unaffected. Encryption with customer-managed keys is not available for free search services, and is only available for paid services created on or after January 1, 2019. |
||
scoringProfiles |
The scoring profiles for the index. |
||
semantic |
Defines parameters for a search index that influence semantic capabilities. |
||
similarity | Similarity: |
The type of similarity algorithm to be used when scoring and ranking the documents matching a search query. The similarity algorithm can only be defined at index creation time and cannot be modified on existing indexes. If null, the ClassicSimilarity algorithm is used. |
|
suggesters |
The suggesters for the index. |
||
tokenFilters |
TokenFilter[]:
|
The token filters for the index. |
|
tokenizers | LexicalTokenizer[]: |
The tokenizers for the index. |
|
vectorSearch |
Contains configuration options related to vector search. |
Responses
Name | Type | Description |
---|---|---|
200 OK | ||
201 Created | ||
Other Status Codes |
Error response. |
Examples
SearchServiceCreateOrUpdateIndex
Sample request
PUT https://myservice.search.windows.net/indexes('hotels')?allowIndexDowntime=False&api-version=2024-07-01
{
"name": "hotels",
"fields": [
{
"name": "hotelId",
"type": "Edm.String",
"key": true,
"searchable": false
},
{
"name": "baseRate",
"type": "Edm.Double"
},
{
"name": "description",
"type": "Edm.String",
"filterable": false,
"sortable": false,
"facetable": false
},
{
"name": "descriptionEmbedding",
"type": "Collection(Edm.Single)",
"dimensions": 1536,
"vectorSearchProfile": "myHnswProfile",
"searchable": true,
"retrievable": true
},
{
"name": "description_fr",
"type": "Edm.String",
"filterable": false,
"sortable": false,
"facetable": false,
"analyzer": "fr.lucene"
},
{
"name": "hotelName",
"type": "Edm.String"
},
{
"name": "category",
"type": "Edm.String"
},
{
"name": "tags",
"type": "Collection(Edm.String)",
"analyzer": "tagsAnalyzer"
},
{
"name": "parkingIncluded",
"type": "Edm.Boolean"
},
{
"name": "smokingAllowed",
"type": "Edm.Boolean"
},
{
"name": "lastRenovationDate",
"type": "Edm.DateTimeOffset"
},
{
"name": "rating",
"type": "Edm.Int32"
},
{
"name": "location",
"type": "Edm.GeographyPoint"
}
],
"scoringProfiles": [
{
"name": "geo",
"text": {
"weights": {
"hotelName": 5
}
},
"functions": [
{
"type": "distance",
"boost": 5,
"fieldName": "location",
"interpolation": "logarithmic",
"distance": {
"referencePointParameter": "currentLocation",
"boostingDistance": 10
}
}
]
}
],
"defaultScoringProfile": "geo",
"suggesters": [
{
"name": "sg",
"searchMode": "analyzingInfixMatching",
"sourceFields": [
"hotelName"
]
}
],
"analyzers": [
{
"name": "tagsAnalyzer",
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [
"html_strip"
],
"tokenizer": "standard_v2"
}
],
"corsOptions": {
"allowedOrigins": [
"tempuri.org"
],
"maxAgeInSeconds": 60
},
"encryptionKey": {
"keyVaultKeyName": "myUserManagedEncryptionKey-createdinAzureKeyVault",
"keyVaultKeyVersion": "myKeyVersion-32charAlphaNumericString",
"keyVaultUri": "https://myKeyVault.vault.azure.net",
"accessCredentials": null
},
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.ClassicSimilarity"
},
"semantic": {
"configurations": [
{
"name": "semanticHotels",
"prioritizedFields": {
"titleField": {
"fieldName": "hotelName"
},
"prioritizedContentFields": [
{
"fieldName": "description"
},
{
"fieldName": "description_fr"
}
],
"prioritizedKeywordsFields": [
{
"fieldName": "tags"
},
{
"fieldName": "category"
}
]
}
}
]
},
"vectorSearch": {
"profiles": [
{
"name": "myHnswProfile",
"algorithm": "myHnsw"
},
{
"name": "myAlgorithm",
"algorithm": "myExhaustive"
}
],
"algorithms": [
{
"name": "myHnsw",
"kind": "hnsw",
"hnswParameters": {
"m": 4,
"metric": "cosine"
}
},
{
"name": "myExhaustive",
"kind": "exhaustiveKnn",
"exhaustiveKnnParameters": {
"metric": "cosine"
}
}
]
}
}
Sample response
{
"name": "hotels",
"fields": [
{
"name": "hotelId",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "baseRate",
"type": "Edm.Double",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "description",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "descriptionEmbedding",
"type": "Collection(Edm.Single)",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": 1536,
"vectorSearchProfile": "myHnswProfile",
"synonymMaps": []
},
{
"name": "description_fr",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "fr.lucene",
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "hotelName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "category",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "tags",
"type": "Collection(Edm.String)",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "tagsAnalyzer",
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "parkingIncluded",
"type": "Edm.Boolean",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "smokingAllowed",
"type": "Edm.Boolean",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "lastRenovationDate",
"type": "Edm.DateTimeOffset",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "rating",
"type": "Edm.Int32",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "location",
"type": "Edm.GeographyPoint",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
}
],
"scoringProfiles": [
{
"name": "geo",
"functionAggregation": "sum",
"text": {
"weights": {
"hotelName": 5
}
},
"functions": [
{
"type": "distance",
"boost": 5,
"fieldName": "location",
"interpolation": "logarithmic",
"distance": {
"referencePointParameter": "currentLocation",
"boostingDistance": 10
}
}
]
}
],
"defaultScoringProfile": "geo",
"suggesters": [
{
"name": "sg",
"searchMode": "analyzingInfixMatching",
"sourceFields": [
"hotelName"
]
}
],
"analyzers": [
{
"name": "tagsAnalyzer",
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [
"html_strip"
],
"tokenizer": "standard_v2"
}
],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"corsOptions": {
"allowedOrigins": [
"tempuri.org"
],
"maxAgeInSeconds": 60
},
"encryptionKey": {
"keyVaultKeyName": "myUserManagedEncryptionKey-createdinAzureKeyVault",
"keyVaultKeyVersion": "myKeyVersion-32charAlphaNumericString",
"keyVaultUri": "https://myKeyVault.vault.azure.net",
"accessCredentials": null
},
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.ClassicSimilarity"
},
"semantic": {
"configurations": [
{
"name": "semanticHotels",
"prioritizedFields": {
"titleField": {
"fieldName": "hotelName"
},
"prioritizedContentFields": [
{
"fieldName": "description"
},
{
"fieldName": "description_fr"
}
],
"prioritizedKeywordsFields": [
{
"fieldName": "tags"
},
{
"fieldName": "category"
}
]
}
}
]
},
"vectorSearch": {
"algorithms": [
{
"name": "myHnsw",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4,
"efConstruction": 400,
"efSearch": 500
}
},
{
"name": "myExhaustive",
"kind": "exhaustiveKnn",
"exhaustiveKnnParameters": {
"metric": "cosine"
}
}
],
"profiles": [
{
"name": "myHnswProfile",
"algorithm": "myHnsw"
},
{
"name": "myAlgorithm",
"algorithm": "myExhaustive"
}
]
}
}
{
"name": "hotels",
"fields": [
{
"name": "hotelId",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "baseRate",
"type": "Edm.Double",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "description",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "descriptionEmbedding",
"type": "Collection(Edm.Single)",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": 1536,
"vectorSearchProfile": "myHnswProfile",
"synonymMaps": []
},
{
"name": "description_fr",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "fr.lucene",
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "hotelName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "category",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "tags",
"type": "Collection(Edm.String)",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "tagsAnalyzer",
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "parkingIncluded",
"type": "Edm.Boolean",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "smokingAllowed",
"type": "Edm.Boolean",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "lastRenovationDate",
"type": "Edm.DateTimeOffset",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "rating",
"type": "Edm.Int32",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "location",
"type": "Edm.GeographyPoint",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
}
],
"scoringProfiles": [
{
"name": "geo",
"functionAggregation": "sum",
"text": {
"weights": {
"hotelName": 5
}
},
"functions": [
{
"type": "distance",
"boost": 5,
"fieldName": "location",
"interpolation": "logarithmic",
"distance": {
"referencePointParameter": "currentLocation",
"boostingDistance": 10
}
}
]
}
],
"defaultScoringProfile": "geo",
"suggesters": [
{
"name": "sg",
"searchMode": "analyzingInfixMatching",
"sourceFields": [
"hotelName"
]
}
],
"analyzers": [
{
"name": "tagsAnalyzer",
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [
"html_strip"
],
"tokenizer": "standard_v2"
}
],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"corsOptions": {
"allowedOrigins": [
"tempuri.org"
],
"maxAgeInSeconds": 60
},
"encryptionKey": {
"keyVaultKeyName": "myUserManagedEncryptionKey-createdinAzureKeyVault",
"keyVaultKeyVersion": "myKeyVersion-32charAlphaNumericString",
"keyVaultUri": "https://myKeyVault.vault.azure.net",
"accessCredentials": null
},
"semantic": {
"configurations": [
{
"name": "semanticHotels",
"prioritizedFields": {
"titleField": {
"fieldName": "hotelName"
},
"prioritizedContentFields": [
{
"fieldName": "description"
},
{
"fieldName": "description_fr"
}
],
"prioritizedKeywordsFields": [
{
"fieldName": "tags"
},
{
"fieldName": "category"
}
]
}
}
]
},
"vectorSearch": {
"algorithms": [
{
"name": "myHnsw",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4,
"efConstruction": 400,
"efSearch": 500
}
},
{
"name": "myExhaustive",
"kind": "exhaustiveKnn",
"exhaustiveKnnParameters": {
"metric": "cosine"
}
}
],
"profiles": [
{
"name": "myHnswProfile",
"algorithm": "myHnsw"
},
{
"name": "myAlgorithm",
"algorithm": "myExhaustive"
}
]
}
}
Definitions
Name | Description |
---|---|
Ascii |
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if such equivalents exist. This token filter is implemented using Apache Lucene. |
Azure |
Credentials of a registered application created for your search service, used for authenticated access to the encryption keys stored in Azure Key Vault. |
Azure |
Allows you to generate a vector embedding for a given text input using the Azure OpenAI resource. |
Azure |
The Azure Open AI model name that will be called. |
Azure |
Specifies the parameters for connecting to the Azure OpenAI resource. |
Azure |
Specifies the Azure OpenAI resource used to vectorize a query string. |
Binary |
Contains configuration options specific to the binary quantization compression method used during indexing and querying. |
BM25Similarity |
Ranking function based on the Okapi BM25 similarity algorithm. BM25 is a TF-IDF-like algorithm that includes length normalization (controlled by the 'b' parameter) as well as term frequency saturation (controlled by the 'k1' parameter). |
Char |
Defines the names of all character filters supported by the search engine. |
Cjk |
Forms bigrams of CJK terms that are generated from the standard tokenizer. This token filter is implemented using Apache Lucene. |
Cjk |
Scripts that can be ignored by CjkBigramTokenFilter. |
Classic |
Legacy similarity algorithm which uses the Lucene TFIDFSimilarity implementation of TF-IDF. This variation of TF-IDF introduces static document length normalization as well as coordinating factors that penalize documents that only partially match the searched queries. |
Classic |
Grammar-based tokenizer that is suitable for processing most European-language documents. This tokenizer is implemented using Apache Lucene. |
Common |
Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This token filter is implemented using Apache Lucene. |
Cors |
Defines options to control Cross-Origin Resource Sharing (CORS) for an index. |
Custom |
Allows you to take control over the process of converting text into indexable/searchable tokens. It's a user-defined configuration consisting of a single predefined tokenizer and one or more filters. The tokenizer is responsible for breaking text into tokens, and the filters for modifying tokens emitted by the tokenizer. |
Dictionary |
Decomposes compound words found in many Germanic languages. This token filter is implemented using Apache Lucene. |
Distance |
Defines a function that boosts scores based on distance from a geographic location. |
Distance |
Provides parameter values to a distance scoring function. |
Edge |
Generates n-grams of the given size(s) starting from the front or the back of an input token. This token filter is implemented using Apache Lucene. |
Edge |
Specifies which side of the input an n-gram should be generated from. |
Edge |
Generates n-grams of the given size(s) starting from the front or the back of an input token. This token filter is implemented using Apache Lucene. |
Edge |
Tokenizes the input from an edge into n-grams of the given size(s). This tokenizer is implemented using Apache Lucene. |
Elision |
Removes elisions. For example, "l'avion" (the plane) will be converted to "avion" (plane). This token filter is implemented using Apache Lucene. |
Error |
The resource management error additional info. |
Error |
The error detail. |
Error |
Error response |
Exhaustive |
Contains the parameters specific to exhaustive KNN algorithm. |
Exhaustive |
Contains configuration options specific to the exhaustive KNN algorithm used during querying, which will perform brute-force search across the entire vector index. |
Freshness |
Defines a function that boosts scores based on the value of a date-time field. |
Freshness |
Provides parameter values to a freshness scoring function. |
Hnsw |
Contains the parameters specific to the HNSW algorithm. |
Hnsw |
Contains configuration options specific to the HNSW approximate nearest neighbors algorithm used during indexing and querying. The HNSW algorithm offers a tunable trade-off between search speed and accuracy. |
Input |
Input field mapping for a skill. |
Keep |
A token filter that only keeps tokens with text contained in a specified list of words. This token filter is implemented using Apache Lucene. |
Keyword |
Marks terms as keywords. This token filter is implemented using Apache Lucene. |
Keyword |
Emits the entire input as a single token. This tokenizer is implemented using Apache Lucene. |
Keyword |
Emits the entire input as a single token. This tokenizer is implemented using Apache Lucene. |
Length |
Removes words that are too long or too short. This token filter is implemented using Apache Lucene. |
Lexical |
Defines the names of all text analyzers supported by the search engine. |
Lexical |
Defines the names of all tokenizers supported by the search engine. |
Limit |
Limits the number of tokens while indexing. This token filter is implemented using Apache Lucene. |
Lucene |
Standard Apache Lucene analyzer; Composed of the standard tokenizer, lowercase filter and stop filter. |
Lucene |
Breaks text following the Unicode Text Segmentation rules. This tokenizer is implemented using Apache Lucene. |
Lucene |
Breaks text following the Unicode Text Segmentation rules. This tokenizer is implemented using Apache Lucene. |
Magnitude |
Defines a function that boosts scores based on the magnitude of a numeric field. |
Magnitude |
Provides parameter values to a magnitude scoring function. |
Mapping |
A character filter that applies mappings defined with the mappings option. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string. This character filter is implemented using Apache Lucene. |
Microsoft |
Divides text using language-specific rules and reduces words to their base forms. |
Microsoft |
Divides text using language-specific rules. |
Microsoft |
Lists the languages supported by the Microsoft language stemming tokenizer. |
Microsoft |
Lists the languages supported by the Microsoft language tokenizer. |
NGram |
Generates n-grams of the given size(s). This token filter is implemented using Apache Lucene. |
NGram |
Generates n-grams of the given size(s). This token filter is implemented using Apache Lucene. |
NGram |
Tokenizes the input into n-grams of the given size(s). This tokenizer is implemented using Apache Lucene. |
Output |
Output field mapping for a skill. |
Path |
Tokenizer for path-like hierarchies. This tokenizer is implemented using Apache Lucene. |
Pattern |
Flexibly separates text into terms via a regular expression pattern. This analyzer is implemented using Apache Lucene. |
Pattern |
Uses Java regexes to emit multiple tokens - one for each capture group in one or more patterns. This token filter is implemented using Apache Lucene. |
Pattern |
A character filter that replaces characters in the input string. It uses a regular expression to identify character sequences to preserve and a replacement pattern to identify characters to replace. For example, given the input text "aa bb aa bb", pattern "(aa)\s+(bb)", and replacement "$1#$2", the result would be "aa#bb aa#bb". This character filter is implemented using Apache Lucene. |
Pattern |
A character filter that replaces characters in the input string. It uses a regular expression to identify character sequences to preserve and a replacement pattern to identify characters to replace. For example, given the input text "aa bb aa bb", pattern "(aa)\s+(bb)", and replacement "$1#$2", the result would be "aa#bb aa#bb". This token filter is implemented using Apache Lucene. |
Pattern |
Tokenizer that uses regex pattern matching to construct distinct tokens. This tokenizer is implemented using Apache Lucene. |
Phonetic |
Identifies the type of phonetic encoder to use with a PhoneticTokenFilter. |
Phonetic |
Create tokens for phonetic matches. This token filter is implemented using Apache Lucene. |
Prioritized |
Describes the title, content, and keywords fields to be used for semantic ranking, captions, highlights, and answers. |
Regex |
Defines flags that can be combined to control how regular expressions are used in the pattern analyzer and pattern tokenizer. |
Scalar |
Contains the parameters specific to Scalar Quantization. |
Scalar |
Contains configuration options specific to the scalar quantization compression method used during indexing and querying. |
Scoring |
Defines the aggregation function used to combine the results of all the scoring functions in a scoring profile. |
Scoring |
Defines the function used to interpolate score boosting across a range of documents. |
Scoring |
Defines parameters for a search index that influence scoring in search queries. |
Search |
Represents a field in an index definition, which describes the name, data type, and search behavior of a field. |
Search |
Defines the data type of a field in a search index. |
Search |
Represents a search index definition, which describes the fields and search behavior of an index. |
Search |
Clears the identity property of a datasource. |
Search |
Specifies the identity for a datasource to use. |
Search |
A customer-managed encryption key in Azure Key Vault. Keys that you create and manage can be used to encrypt or decrypt data-at-rest, such as indexes and synonym maps. |
Semantic |
Defines a specific configuration to be used in the context of semantic capabilities. |
Semantic |
A field that is used as part of the semantic configuration. |
Semantic |
Defines parameters for a search index that influence semantic capabilities. |
Shingle |
Creates combinations of tokens as a single token. This token filter is implemented using Apache Lucene. |
Snowball |
A filter that stems words using a Snowball-generated stemmer. This token filter is implemented using Apache Lucene. |
Snowball |
The language to use for a Snowball token filter. |
Stemmer |
Provides the ability to override other stemming filters with custom dictionary-based stemming. Any dictionary-stemmed terms will be marked as keywords so that they will not be stemmed with stemmers down the chain. Must be placed before any stemming filters. This token filter is implemented using Apache Lucene. |
Stemmer |
Language specific stemming filter. This token filter is implemented using Apache Lucene. |
Stemmer |
The language to use for a stemmer token filter. |
Stop |
Divides text at non-letters; Applies the lowercase and stopword token filters. This analyzer is implemented using Apache Lucene. |
Stopwords |
Identifies a predefined list of language-specific stopwords. |
Stopwords |
Removes stop words from a token stream. This token filter is implemented using Apache Lucene. |
Suggester |
Defines how the Suggest API should apply to a group of fields in the index. |
Suggester |
A value indicating the capabilities of the suggester. |
Synonym |
Matches single or multi-word synonyms in a token stream. This token filter is implemented using Apache Lucene. |
Tag |
Defines a function that boosts scores of documents with string values matching a given list of tags. |
Tag |
Provides parameter values to a tag scoring function. |
Text |
Defines weights on index fields for which matches should boost scoring in search queries. |
Token |
Represents classes of characters on which a token filter can operate. |
Token |
Defines the names of all token filters supported by the search engine. |
Truncate |
Truncates the terms to a specific length. This token filter is implemented using Apache Lucene. |
Uax |
Tokenizes urls and emails as one token. This tokenizer is implemented using Apache Lucene. |
Unique |
Filters out tokens with same text as the previous token. This token filter is implemented using Apache Lucene. |
Vector |
The encoding format for interpreting vector field contents. |
Vector |
Contains configuration options related to vector search. |
Vector |
The algorithm used for indexing and querying. |
Vector |
The similarity metric to use for vector comparisons. It is recommended to choose the same similarity metric as the embedding model was trained on. |
Vector |
The compression method used for indexing and querying. |
Vector |
The quantized data type of compressed vector values. |
Vector |
Defines a combination of configurations to use with vector search. |
Vector |
The vectorization method to be used during query time. |
Web |
Specifies the properties for connecting to a user-defined vectorizer. |
Web |
Specifies a user-defined vectorizer for generating the vector embedding of a query string. Integration of an external vectorizer is achieved using the custom Web API interface of a skillset. |
Word |
Splits words into subwords and performs optional transformations on subword groups. This token filter is implemented using Apache Lucene. |
AsciiFoldingTokenFilter
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if such equivalents exist. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
preserveOriginal |
boolean |
False |
A value indicating whether the original token will be kept. Default is false. |
AzureActiveDirectoryApplicationCredentials
Credentials of a registered application created for your search service, used for authenticated access to the encryption keys stored in Azure Key Vault.
Name | Type | Description |
---|---|---|
applicationId |
string |
An AAD Application ID that was granted the required access permissions to the Azure Key Vault that is to be used when encrypting your data at rest. The Application ID should not be confused with the Object ID for your AAD Application. |
applicationSecret |
string |
The authentication key of the specified AAD application. |
AzureOpenAIEmbeddingSkill
Allows you to generate a vector embedding for a given text input using the Azure OpenAI resource.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of skill. |
apiKey |
string |
API key of the designated Azure OpenAI resource. |
authIdentity | SearchIndexerDataIdentity: |
The user-assigned managed identity used for outbound connections. |
context |
string |
Represents the level at which operations take place, such as the document root or document content (for example, /document or /document/content). The default is /document. |
deploymentId |
string |
ID of the Azure OpenAI model deployment on the designated resource. |
description |
string |
The description of the skill which describes the inputs, outputs, and usage of the skill. |
dimensions |
integer |
The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3 and later models. |
inputs |
Inputs of the skills could be a column in the source data set, or the output of an upstream skill. |
|
modelName |
The name of the embedding model that is deployed at the provided deploymentId path. |
|
name |
string |
The name of the skill which uniquely identifies it within the skillset. A skill with no name defined will be given a default name of its 1-based index in the skills array, prefixed with the character '#'. |
outputs |
The output of a skill is either a field in a search index, or a value that can be consumed as an input by another skill. |
|
resourceUri |
string |
The resource URI of the Azure OpenAI resource. |
AzureOpenAIModelName
The Azure Open AI model name that will be called.
Name | Type | Description |
---|---|---|
text-embedding-3-large |
string |
|
text-embedding-3-small |
string |
|
text-embedding-ada-002 |
string |
AzureOpenAIParameters
Specifies the parameters for connecting to the Azure OpenAI resource.
Name | Type | Description |
---|---|---|
apiKey |
string |
API key of the designated Azure OpenAI resource. |
authIdentity | SearchIndexerDataIdentity: |
The user-assigned managed identity used for outbound connections. |
deploymentId |
string |
ID of the Azure OpenAI model deployment on the designated resource. |
modelName |
The name of the embedding model that is deployed at the provided deploymentId path. |
|
resourceUri |
string |
The resource URI of the Azure OpenAI resource. |
AzureOpenAIVectorizer
Specifies the Azure OpenAI resource used to vectorize a query string.
Name | Type | Description |
---|---|---|
azureOpenAIParameters | AzureOpenAIParameters: |
Contains the parameters specific to Azure OpenAI embedding vectorization. |
kind |
string:
azure |
The name of the kind of vectorization method being configured for use with vector search. |
name |
string |
The name to associate with this particular vectorization method. |
BinaryQuantizationVectorSearchCompressionConfiguration
Contains configuration options specific to the binary quantization compression method used during indexing and querying.
Name | Type | Default value | Description |
---|---|---|---|
defaultOversampling |
number |
Default oversampling factor. Oversampling will internally request more documents (specified by this multiplier) in the initial search. This increases the set of results that will be reranked using recomputed similarity scores from full-precision vectors. Minimum value is 1, meaning no oversampling (1x). This parameter can only be set when rerankWithOriginalVectors is true. Higher values improve recall at the expense of latency. |
|
kind |
string:
binary |
The name of the kind of compression method being configured for use with vector search. |
|
name |
string |
The name to associate with this particular configuration. |
|
rerankWithOriginalVectors |
boolean |
True |
If set to true, once the ordered set of results calculated using compressed vectors are obtained, they will be reranked again by recalculating the full-precision similarity scores. This will improve recall at the expense of latency. |
BM25Similarity
Ranking function based on the Okapi BM25 similarity algorithm. BM25 is a TF-IDF-like algorithm that includes length normalization (controlled by the 'b' parameter) as well as term frequency saturation (controlled by the 'k1' parameter).
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
|
b |
number |
This property controls how the length of a document affects the relevance score. By default, a value of 0.75 is used. A value of 0.0 means no length normalization is applied, while a value of 1.0 means the score is fully normalized by the length of the document. |
k1 |
number |
This property controls the scaling function between the term frequency of each matching terms and the final relevance score of a document-query pair. By default, a value of 1.2 is used. A value of 0.0 means the score does not scale with an increase in term frequency. |
CharFilterName
Defines the names of all character filters supported by the search engine.
Name | Type | Description |
---|---|---|
html_strip |
string |
A character filter that attempts to strip out HTML constructs. See https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.html |
CjkBigramTokenFilter
Forms bigrams of CJK terms that are generated from the standard tokenizer. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
ignoreScripts |
The scripts to ignore. |
||
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
outputUnigrams |
boolean |
False |
A value indicating whether to output both unigrams and bigrams (if true), or just bigrams (if false). Default is false. |
CjkBigramTokenFilterScripts
Scripts that can be ignored by CjkBigramTokenFilter.
Name | Type | Description |
---|---|---|
han |
string |
Ignore Han script when forming bigrams of CJK terms. |
hangul |
string |
Ignore Hangul script when forming bigrams of CJK terms. |
hiragana |
string |
Ignore Hiragana script when forming bigrams of CJK terms. |
katakana |
string |
Ignore Katakana script when forming bigrams of CJK terms. |
ClassicSimilarity
Legacy similarity algorithm which uses the Lucene TFIDFSimilarity implementation of TF-IDF. This variation of TF-IDF introduces static document length normalization as well as coordinating factors that penalize documents that only partially match the searched queries.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
ClassicTokenizer
Grammar-based tokenizer that is suitable for processing most European-language documents. This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
maxTokenLength |
integer |
255 |
The maximum token length. Default is 255. Tokens longer than the maximum length are split. The maximum token length that can be used is 300 characters. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
CommonGramTokenFilter
Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
commonWords |
string[] |
The set of common words. |
|
ignoreCase |
boolean |
False |
A value indicating whether common words matching will be case insensitive. Default is false. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
queryMode |
boolean |
False |
A value that indicates whether the token filter is in query mode. When in query mode, the token filter generates bigrams and then removes common words and single terms followed by a common word. Default is false. |
CorsOptions
Defines options to control Cross-Origin Resource Sharing (CORS) for an index.
Name | Type | Description |
---|---|---|
allowedOrigins |
string[] |
The list of origins from which JavaScript code will be granted access to your index. Can contain a list of hosts of the form {protocol}://{fully-qualified-domain-name}[:{port#}], or a single '*' to allow all origins (not recommended). |
maxAgeInSeconds |
integer |
The duration for which browsers should cache CORS preflight responses. Defaults to 5 minutes. |
CustomAnalyzer
Allows you to take control over the process of converting text into indexable/searchable tokens. It's a user-defined configuration consisting of a single predefined tokenizer and one or more filters. The tokenizer is responsible for breaking text into tokens, and the filters for modifying tokens emitted by the tokenizer.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of analyzer. |
charFilters |
A list of character filters used to prepare input text before it is processed by the tokenizer. For instance, they can replace certain characters or symbols. The filters are run in the order in which they are listed. |
|
name |
string |
The name of the analyzer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
tokenFilters |
A list of token filters used to filter out or modify the tokens generated by a tokenizer. For example, you can specify a lowercase filter that converts all characters to lowercase. The filters are run in the order in which they are listed. |
|
tokenizer |
The name of the tokenizer to use to divide continuous text into a sequence of tokens, such as breaking a sentence into words. |
DictionaryDecompounderTokenFilter
Decomposes compound words found in many Germanic languages. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
maxSubwordSize |
integer |
15 |
The maximum subword size. Only subwords shorter than this are outputted. Default is 15. Maximum is 300. |
minSubwordSize |
integer |
2 |
The minimum subword size. Only subwords longer than this are outputted. Default is 2. Maximum is 300. |
minWordSize |
integer |
5 |
The minimum word size. Only words longer than this get processed. Default is 5. Maximum is 300. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
onlyLongestMatch |
boolean |
False |
A value indicating whether to add only the longest matching subword to the output. Default is false. |
wordList |
string[] |
The list of words to match against. |
DistanceScoringFunction
Defines a function that boosts scores based on distance from a geographic location.
Name | Type | Description |
---|---|---|
boost |
number |
A multiplier for the raw score. Must be a positive number not equal to 1.0. |
distance |
Parameter values for the distance scoring function. |
|
fieldName |
string |
The name of the field used as input to the scoring function. |
interpolation |
A value indicating how boosting will be interpolated across document scores; defaults to "Linear". |
|
type |
string:
distance |
Indicates the type of function to use. Valid values include magnitude, freshness, distance, and tag. The function type must be lower case. |
DistanceScoringParameters
Provides parameter values to a distance scoring function.
Name | Type | Description |
---|---|---|
boostingDistance |
number |
The distance in kilometers from the reference location where the boosting range ends. |
referencePointParameter |
string |
The name of the parameter passed in search queries to specify the reference location. |
EdgeNGramTokenFilter
Generates n-grams of the given size(s) starting from the front or the back of an input token. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
maxGram |
integer |
2 |
The maximum n-gram length. Default is 2. |
minGram |
integer |
1 |
The minimum n-gram length. Default is 1. Must be less than the value of maxGram. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
side | front |
Specifies which side of the input the n-gram should be generated from. Default is "front". |
EdgeNGramTokenFilterSide
Specifies which side of the input an n-gram should be generated from.
Name | Type | Description |
---|---|---|
back |
string |
Specifies that the n-gram should be generated from the back of the input. |
front |
string |
Specifies that the n-gram should be generated from the front of the input. |
EdgeNGramTokenFilterV2
Generates n-grams of the given size(s) starting from the front or the back of an input token. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
maxGram |
integer |
2 |
The maximum n-gram length. Default is 2. Maximum is 300. |
minGram |
integer |
1 |
The minimum n-gram length. Default is 1. Maximum is 300. Must be less than the value of maxGram. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
side | front |
Specifies which side of the input the n-gram should be generated from. Default is "front". |
EdgeNGramTokenizer
Tokenizes the input from an edge into n-grams of the given size(s). This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
maxGram |
integer |
2 |
The maximum n-gram length. Default is 2. Maximum is 300. |
minGram |
integer |
1 |
The minimum n-gram length. Default is 1. Maximum is 300. Must be less than the value of maxGram. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
tokenChars |
Character classes to keep in the tokens. |
ElisionTokenFilter
Removes elisions. For example, "l'avion" (the plane) will be converted to "avion" (plane). This token filter is implemented using Apache Lucene.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
articles |
string[] |
The set of articles to remove. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
ErrorAdditionalInfo
The resource management error additional info.
Name | Type | Description |
---|---|---|
info |
object |
The additional info. |
type |
string |
The additional info type. |
ErrorDetail
The error detail.
Name | Type | Description |
---|---|---|
additionalInfo |
The error additional info. |
|
code |
string |
The error code. |
details |
The error details. |
|
message |
string |
The error message. |
target |
string |
The error target. |
ErrorResponse
Error response
Name | Type | Description |
---|---|---|
error |
The error object. |
ExhaustiveKnnParameters
Contains the parameters specific to exhaustive KNN algorithm.
Name | Type | Description |
---|---|---|
metric |
The similarity metric to use for vector comparisons. |
ExhaustiveKnnVectorSearchAlgorithmConfiguration
Contains configuration options specific to the exhaustive KNN algorithm used during querying, which will perform brute-force search across the entire vector index.
Name | Type | Description |
---|---|---|
exhaustiveKnnParameters |
Contains the parameters specific to exhaustive KNN algorithm. |
|
kind |
string:
exhaustive |
The name of the kind of algorithm being configured for use with vector search. |
name |
string |
The name to associate with this particular configuration. |
FreshnessScoringFunction
Defines a function that boosts scores based on the value of a date-time field.
Name | Type | Description |
---|---|---|
boost |
number |
A multiplier for the raw score. Must be a positive number not equal to 1.0. |
fieldName |
string |
The name of the field used as input to the scoring function. |
freshness |
Parameter values for the freshness scoring function. |
|
interpolation |
A value indicating how boosting will be interpolated across document scores; defaults to "Linear". |
|
type |
string:
freshness |
Indicates the type of function to use. Valid values include magnitude, freshness, distance, and tag. The function type must be lower case. |
FreshnessScoringParameters
Provides parameter values to a freshness scoring function.
Name | Type | Description |
---|---|---|
boostingDuration |
string |
The expiration period after which boosting will stop for a particular document. |
HnswParameters
Contains the parameters specific to the HNSW algorithm.
Name | Type | Default value | Description |
---|---|---|---|
efConstruction |
integer |
400 |
The size of the dynamic list containing the nearest neighbors, which is used during index time. Increasing this parameter may improve index quality, at the expense of increased indexing time. At a certain point, increasing this parameter leads to diminishing returns. |
efSearch |
integer |
500 |
The size of the dynamic list containing the nearest neighbors, which is used during search time. Increasing this parameter may improve search results, at the expense of slower search. At a certain point, increasing this parameter leads to diminishing returns. |
m |
integer |
4 |
The number of bi-directional links created for every new element during construction. Increasing this parameter value may improve recall and reduce retrieval times for datasets with high intrinsic dimensionality at the expense of increased memory consumption and longer indexing time. |
metric |
The similarity metric to use for vector comparisons. |
HnswVectorSearchAlgorithmConfiguration
Contains configuration options specific to the HNSW approximate nearest neighbors algorithm used during indexing and querying. The HNSW algorithm offers a tunable trade-off between search speed and accuracy.
Name | Type | Description |
---|---|---|
hnswParameters |
Contains the parameters specific to HNSW algorithm. |
|
kind |
string:
hnsw |
The name of the kind of algorithm being configured for use with vector search. |
name |
string |
The name to associate with this particular configuration. |
InputFieldMappingEntry
Input field mapping for a skill.
Name | Type | Description |
---|---|---|
inputs |
The recursive inputs used when creating a complex type. |
|
name |
string |
The name of the input. |
source |
string |
The source of the input. |
sourceContext |
string |
The source context used for selecting recursive inputs. |
KeepTokenFilter
A token filter that only keeps tokens with text contained in a specified list of words. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
keepWords |
string[] |
The list of words to keep. |
|
keepWordsCase |
boolean |
False |
A value indicating whether to lower case all words first. Default is false. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
KeywordMarkerTokenFilter
Marks terms as keywords. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
ignoreCase |
boolean |
False |
A value indicating whether to ignore case. If true, all words are converted to lower case first. Default is false. |
keywords |
string[] |
A list of words to mark as keywords. |
|
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
KeywordTokenizer
Emits the entire input as a single token. This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
bufferSize |
integer |
256 |
The read buffer size in bytes. Default is 256. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
KeywordTokenizerV2
Emits the entire input as a single token. This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
maxTokenLength |
integer |
256 |
The maximum token length. Default is 256. Tokens longer than the maximum length are split. The maximum token length that can be used is 300 characters. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
LengthTokenFilter
Removes words that are too long or too short. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
max |
integer |
300 |
The maximum length in characters. Default and maximum is 300. |
min |
integer |
0 |
The minimum length in characters. Default is 0. Maximum is 300. Must be less than the value of max. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
LexicalAnalyzerName
Defines the names of all text analyzers supported by the search engine.
Name | Type | Description |
---|---|---|
ar.lucene |
string |
Lucene analyzer for Arabic. |
ar.microsoft |
string |
Microsoft analyzer for Arabic. |
bg.lucene |
string |
Lucene analyzer for Bulgarian. |
bg.microsoft |
string |
Microsoft analyzer for Bulgarian. |
bn.microsoft |
string |
Microsoft analyzer for Bangla. |
ca.lucene |
string |
Lucene analyzer for Catalan. |
ca.microsoft |
string |
Microsoft analyzer for Catalan. |
cs.lucene |
string |
Lucene analyzer for Czech. |
cs.microsoft |
string |
Microsoft analyzer for Czech. |
da.lucene |
string |
Lucene analyzer for Danish. |
da.microsoft |
string |
Microsoft analyzer for Danish. |
de.lucene |
string |
Lucene analyzer for German. |
de.microsoft |
string |
Microsoft analyzer for German. |
el.lucene |
string |
Lucene analyzer for Greek. |
el.microsoft |
string |
Microsoft analyzer for Greek. |
en.lucene |
string |
Lucene analyzer for English. |
en.microsoft |
string |
Microsoft analyzer for English. |
es.lucene |
string |
Lucene analyzer for Spanish. |
es.microsoft |
string |
Microsoft analyzer for Spanish. |
et.microsoft |
string |
Microsoft analyzer for Estonian. |
eu.lucene |
string |
Lucene analyzer for Basque. |
fa.lucene |
string |
Lucene analyzer for Persian. |
fi.lucene |
string |
Lucene analyzer for Finnish. |
fi.microsoft |
string |
Microsoft analyzer for Finnish. |
fr.lucene |
string |
Lucene analyzer for French. |
fr.microsoft |
string |
Microsoft analyzer for French. |
ga.lucene |
string |
Lucene analyzer for Irish. |
gl.lucene |
string |
Lucene analyzer for Galician. |
gu.microsoft |
string |
Microsoft analyzer for Gujarati. |
he.microsoft |
string |
Microsoft analyzer for Hebrew. |
hi.lucene |
string |
Lucene analyzer for Hindi. |
hi.microsoft |
string |
Microsoft analyzer for Hindi. |
hr.microsoft |
string |
Microsoft analyzer for Croatian. |
hu.lucene |
string |
Lucene analyzer for Hungarian. |
hu.microsoft |
string |
Microsoft analyzer for Hungarian. |
hy.lucene |
string |
Lucene analyzer for Armenian. |
id.lucene |
string |
Lucene analyzer for Indonesian. |
id.microsoft |
string |
Microsoft analyzer for Indonesian (Bahasa). |
is.microsoft |
string |
Microsoft analyzer for Icelandic. |
it.lucene |
string |
Lucene analyzer for Italian. |
it.microsoft |
string |
Microsoft analyzer for Italian. |
ja.lucene |
string |
Lucene analyzer for Japanese. |
ja.microsoft |
string |
Microsoft analyzer for Japanese. |
keyword |
string |
Treats the entire content of a field as a single token. This is useful for data like zip codes, ids, and some product names. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html |
kn.microsoft |
string |
Microsoft analyzer for Kannada. |
ko.lucene |
string |
Lucene analyzer for Korean. |
ko.microsoft |
string |
Microsoft analyzer for Korean. |
lt.microsoft |
string |
Microsoft analyzer for Lithuanian. |
lv.lucene |
string |
Lucene analyzer for Latvian. |
lv.microsoft |
string |
Microsoft analyzer for Latvian. |
ml.microsoft |
string |
Microsoft analyzer for Malayalam. |
mr.microsoft |
string |
Microsoft analyzer for Marathi. |
ms.microsoft |
string |
Microsoft analyzer for Malay (Latin). |
nb.microsoft |
string |
Microsoft analyzer for Norwegian (Bokmål). |
nl.lucene |
string |
Lucene analyzer for Dutch. |
nl.microsoft |
string |
Microsoft analyzer for Dutch. |
no.lucene |
string |
Lucene analyzer for Norwegian. |
pa.microsoft |
string |
Microsoft analyzer for Punjabi. |
pattern |
string |
Flexibly separates text into terms via a regular expression pattern. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/PatternAnalyzer.html |
pl.lucene |
string |
Lucene analyzer for Polish. |
pl.microsoft |
string |
Microsoft analyzer for Polish. |
pt-BR.lucene |
string |
Lucene analyzer for Portuguese (Brazil). |
pt-BR.microsoft |
string |
Microsoft analyzer for Portuguese (Brazil). |
pt-PT.lucene |
string |
Lucene analyzer for Portuguese (Portugal). |
pt-PT.microsoft |
string |
Microsoft analyzer for Portuguese (Portugal). |
ro.lucene |
string |
Lucene analyzer for Romanian. |
ro.microsoft |
string |
Microsoft analyzer for Romanian. |
ru.lucene |
string |
Lucene analyzer for Russian. |
ru.microsoft |
string |
Microsoft analyzer for Russian. |
simple |
string |
Divides text at non-letters and converts them to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/SimpleAnalyzer.html |
sk.microsoft |
string |
Microsoft analyzer for Slovak. |
sl.microsoft |
string |
Microsoft analyzer for Slovenian. |
sr-cyrillic.microsoft |
string |
Microsoft analyzer for Serbian (Cyrillic). |
sr-latin.microsoft |
string |
Microsoft analyzer for Serbian (Latin). |
standard.lucene |
string |
Standard Lucene analyzer. |
standardasciifolding.lucene |
string |
Standard ASCII Folding Lucene analyzer. See https://learn.microsoft.com/rest/api/searchservice/Custom-analyzers-in-Azure-Search#Analyzers |
stop |
string |
Divides text at non-letters; Applies the lowercase and stopword token filters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/StopAnalyzer.html |
sv.lucene |
string |
Lucene analyzer for Swedish. |
sv.microsoft |
string |
Microsoft analyzer for Swedish. |
ta.microsoft |
string |
Microsoft analyzer for Tamil. |
te.microsoft |
string |
Microsoft analyzer for Telugu. |
th.lucene |
string |
Lucene analyzer for Thai. |
th.microsoft |
string |
Microsoft analyzer for Thai. |
tr.lucene |
string |
Lucene analyzer for Turkish. |
tr.microsoft |
string |
Microsoft analyzer for Turkish. |
uk.microsoft |
string |
Microsoft analyzer for Ukrainian. |
ur.microsoft |
string |
Microsoft analyzer for Urdu. |
vi.microsoft |
string |
Microsoft analyzer for Vietnamese. |
whitespace |
string |
An analyzer that uses the whitespace tokenizer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html |
zh-Hans.lucene |
string |
Lucene analyzer for Chinese (Simplified). |
zh-Hans.microsoft |
string |
Microsoft analyzer for Chinese (Simplified). |
zh-Hant.lucene |
string |
Lucene analyzer for Chinese (Traditional). |
zh-Hant.microsoft |
string |
Microsoft analyzer for Chinese (Traditional). |
LexicalTokenizerName
Defines the names of all tokenizers supported by the search engine.
LimitTokenFilter
Limits the number of tokens while indexing. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
consumeAllTokens |
boolean |
False |
A value indicating whether all tokens from the input must be consumed even if maxTokenCount is reached. Default is false. |
maxTokenCount |
integer |
1 |
The maximum number of tokens to produce. Default is 1. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
LuceneStandardAnalyzer
Standard Apache Lucene analyzer; Composed of the standard tokenizer, lowercase filter and stop filter.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of analyzer. |
|
maxTokenLength |
integer |
255 |
The maximum token length. Default is 255. Tokens longer than the maximum length are split. The maximum token length that can be used is 300 characters. |
name |
string |
The name of the analyzer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
stopwords |
string[] |
A list of stopwords. |
LuceneStandardTokenizer
Breaks text following the Unicode Text Segmentation rules. This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
maxTokenLength |
integer |
255 |
The maximum token length. Default is 255. Tokens longer than the maximum length are split. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
LuceneStandardTokenizerV2
Breaks text following the Unicode Text Segmentation rules. This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
maxTokenLength |
integer |
255 |
The maximum token length. Default is 255. Tokens longer than the maximum length are split. The maximum token length that can be used is 300 characters. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
MagnitudeScoringFunction
Defines a function that boosts scores based on the magnitude of a numeric field.
Name | Type | Description |
---|---|---|
boost |
number |
A multiplier for the raw score. Must be a positive number not equal to 1.0. |
fieldName |
string |
The name of the field used as input to the scoring function. |
interpolation |
A value indicating how boosting will be interpolated across document scores; defaults to "Linear". |
|
magnitude |
Parameter values for the magnitude scoring function. |
|
type |
string:
magnitude |
Indicates the type of function to use. Valid values include magnitude, freshness, distance, and tag. The function type must be lower case. |
MagnitudeScoringParameters
Provides parameter values to a magnitude scoring function.
Name | Type | Description |
---|---|---|
boostingRangeEnd |
number |
The field value at which boosting ends. |
boostingRangeStart |
number |
The field value at which boosting starts. |
constantBoostBeyondRange |
boolean |
A value indicating whether to apply a constant boost for field values beyond the range end value; default is false. |
MappingCharFilter
A character filter that applies mappings defined with the mappings option. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string. This character filter is implemented using Apache Lucene.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of char filter. |
mappings |
string[] |
A list of mappings of the following format: "a=>b" (all occurrences of the character "a" will be replaced with character "b"). |
name |
string |
The name of the char filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
MicrosoftLanguageStemmingTokenizer
Divides text using language-specific rules and reduces words to their base forms.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
isSearchTokenizer |
boolean |
False |
A value indicating how the tokenizer is used. Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer. Default is false. |
language |
The language to use. The default is English. |
||
maxTokenLength |
integer |
255 |
The maximum token length. Tokens longer than the maximum length are split. Maximum token length that can be used is 300 characters. Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the max token length set. Default is 255. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
MicrosoftLanguageTokenizer
Divides text using language-specific rules.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
isSearchTokenizer |
boolean |
False |
A value indicating how the tokenizer is used. Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer. Default is false. |
language |
The language to use. The default is English. |
||
maxTokenLength |
integer |
255 |
The maximum token length. Tokens longer than the maximum length are split. Maximum token length that can be used is 300 characters. Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the max token length set. Default is 255. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
MicrosoftStemmingTokenizerLanguage
Lists the languages supported by the Microsoft language stemming tokenizer.
Name | Type | Description |
---|---|---|
arabic |
string |
Selects the Microsoft stemming tokenizer for Arabic. |
bangla |
string |
Selects the Microsoft stemming tokenizer for Bangla. |
bulgarian |
string |
Selects the Microsoft stemming tokenizer for Bulgarian. |
catalan |
string |
Selects the Microsoft stemming tokenizer for Catalan. |
croatian |
string |
Selects the Microsoft stemming tokenizer for Croatian. |
czech |
string |
Selects the Microsoft stemming tokenizer for Czech. |
danish |
string |
Selects the Microsoft stemming tokenizer for Danish. |
dutch |
string |
Selects the Microsoft stemming tokenizer for Dutch. |
english |
string |
Selects the Microsoft stemming tokenizer for English. |
estonian |
string |
Selects the Microsoft stemming tokenizer for Estonian. |
finnish |
string |
Selects the Microsoft stemming tokenizer for Finnish. |
french |
string |
Selects the Microsoft stemming tokenizer for French. |
german |
string |
Selects the Microsoft stemming tokenizer for German. |
greek |
string |
Selects the Microsoft stemming tokenizer for Greek. |
gujarati |
string |
Selects the Microsoft stemming tokenizer for Gujarati. |
hebrew |
string |
Selects the Microsoft stemming tokenizer for Hebrew. |
hindi |
string |
Selects the Microsoft stemming tokenizer for Hindi. |
hungarian |
string |
Selects the Microsoft stemming tokenizer for Hungarian. |
icelandic |
string |
Selects the Microsoft stemming tokenizer for Icelandic. |
indonesian |
string |
Selects the Microsoft stemming tokenizer for Indonesian. |
italian |
string |
Selects the Microsoft stemming tokenizer for Italian. |
kannada |
string |
Selects the Microsoft stemming tokenizer for Kannada. |
latvian |
string |
Selects the Microsoft stemming tokenizer for Latvian. |
lithuanian |
string |
Selects the Microsoft stemming tokenizer for Lithuanian. |
malay |
string |
Selects the Microsoft stemming tokenizer for Malay. |
malayalam |
string |
Selects the Microsoft stemming tokenizer for Malayalam. |
marathi |
string |
Selects the Microsoft stemming tokenizer for Marathi. |
norwegianBokmaal |
string |
Selects the Microsoft stemming tokenizer for Norwegian (Bokmål). |
polish |
string |
Selects the Microsoft stemming tokenizer for Polish. |
portuguese |
string |
Selects the Microsoft stemming tokenizer for Portuguese. |
portugueseBrazilian |
string |
Selects the Microsoft stemming tokenizer for Portuguese (Brazil). |
punjabi |
string |
Selects the Microsoft stemming tokenizer for Punjabi. |
romanian |
string |
Selects the Microsoft stemming tokenizer for Romanian. |
russian |
string |
Selects the Microsoft stemming tokenizer for Russian. |
serbianCyrillic |
string |
Selects the Microsoft stemming tokenizer for Serbian (Cyrillic). |
serbianLatin |
string |
Selects the Microsoft stemming tokenizer for Serbian (Latin). |
slovak |
string |
Selects the Microsoft stemming tokenizer for Slovak. |
slovenian |
string |
Selects the Microsoft stemming tokenizer for Slovenian. |
spanish |
string |
Selects the Microsoft stemming tokenizer for Spanish. |
swedish |
string |
Selects the Microsoft stemming tokenizer for Swedish. |
tamil |
string |
Selects the Microsoft stemming tokenizer for Tamil. |
telugu |
string |
Selects the Microsoft stemming tokenizer for Telugu. |
turkish |
string |
Selects the Microsoft stemming tokenizer for Turkish. |
ukrainian |
string |
Selects the Microsoft stemming tokenizer for Ukrainian. |
urdu |
string |
Selects the Microsoft stemming tokenizer for Urdu. |
MicrosoftTokenizerLanguage
Lists the languages supported by the Microsoft language tokenizer.
Name | Type | Description |
---|---|---|
bangla |
string |
Selects the Microsoft tokenizer for Bangla. |
bulgarian |
string |
Selects the Microsoft tokenizer for Bulgarian. |
catalan |
string |
Selects the Microsoft tokenizer for Catalan. |
chineseSimplified |
string |
Selects the Microsoft tokenizer for Chinese (Simplified). |
chineseTraditional |
string |
Selects the Microsoft tokenizer for Chinese (Traditional). |
croatian |
string |
Selects the Microsoft tokenizer for Croatian. |
czech |
string |
Selects the Microsoft tokenizer for Czech. |
danish |
string |
Selects the Microsoft tokenizer for Danish. |
dutch |
string |
Selects the Microsoft tokenizer for Dutch. |
english |
string |
Selects the Microsoft tokenizer for English. |
french |
string |
Selects the Microsoft tokenizer for French. |
german |
string |
Selects the Microsoft tokenizer for German. |
greek |
string |
Selects the Microsoft tokenizer for Greek. |
gujarati |
string |
Selects the Microsoft tokenizer for Gujarati. |
hindi |
string |
Selects the Microsoft tokenizer for Hindi. |
icelandic |
string |
Selects the Microsoft tokenizer for Icelandic. |
indonesian |
string |
Selects the Microsoft tokenizer for Indonesian. |
italian |
string |
Selects the Microsoft tokenizer for Italian. |
japanese |
string |
Selects the Microsoft tokenizer for Japanese. |
kannada |
string |
Selects the Microsoft tokenizer for Kannada. |
korean |
string |
Selects the Microsoft tokenizer for Korean. |
malay |
string |
Selects the Microsoft tokenizer for Malay. |
malayalam |
string |
Selects the Microsoft tokenizer for Malayalam. |
marathi |
string |
Selects the Microsoft tokenizer for Marathi. |
norwegianBokmaal |
string |
Selects the Microsoft tokenizer for Norwegian (Bokmål). |
polish |
string |
Selects the Microsoft tokenizer for Polish. |
portuguese |
string |
Selects the Microsoft tokenizer for Portuguese. |
portugueseBrazilian |
string |
Selects the Microsoft tokenizer for Portuguese (Brazil). |
punjabi |
string |
Selects the Microsoft tokenizer for Punjabi. |
romanian |
string |
Selects the Microsoft tokenizer for Romanian. |
russian |
string |
Selects the Microsoft tokenizer for Russian. |
serbianCyrillic |
string |
Selects the Microsoft tokenizer for Serbian (Cyrillic). |
serbianLatin |
string |
Selects the Microsoft tokenizer for Serbian (Latin). |
slovenian |
string |
Selects the Microsoft tokenizer for Slovenian. |
spanish |
string |
Selects the Microsoft tokenizer for Spanish. |
swedish |
string |
Selects the Microsoft tokenizer for Swedish. |
tamil |
string |
Selects the Microsoft tokenizer for Tamil. |
telugu |
string |
Selects the Microsoft tokenizer for Telugu. |
thai |
string |
Selects the Microsoft tokenizer for Thai. |
ukrainian |
string |
Selects the Microsoft tokenizer for Ukrainian. |
urdu |
string |
Selects the Microsoft tokenizer for Urdu. |
vietnamese |
string |
Selects the Microsoft tokenizer for Vietnamese. |
NGramTokenFilter
Generates n-grams of the given size(s). This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
maxGram |
integer |
2 |
The maximum n-gram length. Default is 2. |
minGram |
integer |
1 |
The minimum n-gram length. Default is 1. Must be less than the value of maxGram. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
NGramTokenFilterV2
Generates n-grams of the given size(s). This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
maxGram |
integer |
2 |
The maximum n-gram length. Default is 2. Maximum is 300. |
minGram |
integer |
1 |
The minimum n-gram length. Default is 1. Maximum is 300. Must be less than the value of maxGram. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
NGramTokenizer
Tokenizes the input into n-grams of the given size(s). This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
maxGram |
integer |
2 |
The maximum n-gram length. Default is 2. Maximum is 300. |
minGram |
integer |
1 |
The minimum n-gram length. Default is 1. Maximum is 300. Must be less than the value of maxGram. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
tokenChars |
Character classes to keep in the tokens. |
OutputFieldMappingEntry
Output field mapping for a skill.
Name | Type | Description |
---|---|---|
name |
string |
The name of the output defined by the skill. |
targetName |
string |
The target name of the output. It is optional and default to name. |
PathHierarchyTokenizerV2
Tokenizer for path-like hierarchies. This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
delimiter |
string |
/ |
The delimiter character to use. Default is "/". |
maxTokenLength |
integer |
300 |
The maximum token length. Default and maximum is 300. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
replacement |
string |
/ |
A value that, if set, replaces the delimiter character. Default is "/". |
reverse |
boolean |
False |
A value indicating whether to generate tokens in reverse order. Default is false. |
skip |
integer |
0 |
The number of initial tokens to skip. Default is 0. |
PatternAnalyzer
Flexibly separates text into terms via a regular expression pattern. This analyzer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of analyzer. |
|
flags |
Regular expression flags. |
||
lowercase |
boolean |
True |
A value indicating whether terms should be lower-cased. Default is true. |
name |
string |
The name of the analyzer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
pattern |
string |
\W+ |
A regular expression pattern to match token separators. Default is an expression that matches one or more non-word characters. |
stopwords |
string[] |
A list of stopwords. |
PatternCaptureTokenFilter
Uses Java regexes to emit multiple tokens - one for each capture group in one or more patterns. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
patterns |
string[] |
A list of patterns to match against each token. |
|
preserveOriginal |
boolean |
True |
A value indicating whether to return the original token even if one of the patterns matches. Default is true. |
PatternReplaceCharFilter
A character filter that replaces characters in the input string. It uses a regular expression to identify character sequences to preserve and a replacement pattern to identify characters to replace. For example, given the input text "aa bb aa bb", pattern "(aa)\s+(bb)", and replacement "$1#$2", the result would be "aa#bb aa#bb". This character filter is implemented using Apache Lucene.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of char filter. |
name |
string |
The name of the char filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
pattern |
string |
A regular expression pattern. |
replacement |
string |
The replacement text. |
PatternReplaceTokenFilter
A character filter that replaces characters in the input string. It uses a regular expression to identify character sequences to preserve and a replacement pattern to identify characters to replace. For example, given the input text "aa bb aa bb", pattern "(aa)\s+(bb)", and replacement "$1#$2", the result would be "aa#bb aa#bb". This token filter is implemented using Apache Lucene.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
pattern |
string |
A regular expression pattern. |
replacement |
string |
The replacement text. |
PatternTokenizer
Tokenizer that uses regex pattern matching to construct distinct tokens. This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
flags |
Regular expression flags. |
||
group |
integer |
-1 |
The zero-based ordinal of the matching group in the regular expression pattern to extract into tokens. Use -1 if you want to use the entire pattern to split the input into tokens, irrespective of matching groups. Default is -1. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
pattern |
string |
\W+ |
A regular expression pattern to match token separators. Default is an expression that matches one or more non-word characters. |
PhoneticEncoder
Identifies the type of phonetic encoder to use with a PhoneticTokenFilter.
Name | Type | Description |
---|---|---|
beiderMorse |
string |
Encodes a token into a Beider-Morse value. |
caverphone1 |
string |
Encodes a token into a Caverphone 1.0 value. |
caverphone2 |
string |
Encodes a token into a Caverphone 2.0 value. |
cologne |
string |
Encodes a token into a Cologne Phonetic value. |
doubleMetaphone |
string |
Encodes a token into a double metaphone value. |
haasePhonetik |
string |
Encodes a token using the Haase refinement of the Kölner Phonetik algorithm. |
koelnerPhonetik |
string |
Encodes a token using the Kölner Phonetik algorithm. |
metaphone |
string |
Encodes a token into a Metaphone value. |
nysiis |
string |
Encodes a token into a NYSIIS value. |
refinedSoundex |
string |
Encodes a token into a Refined Soundex value. |
soundex |
string |
Encodes a token into a Soundex value. |
PhoneticTokenFilter
Create tokens for phonetic matches. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
encoder | metaphone |
The phonetic encoder to use. Default is "metaphone". |
|
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
replace |
boolean |
True |
A value indicating whether encoded tokens should replace original tokens. If false, encoded tokens are added as synonyms. Default is true. |
PrioritizedFields
Describes the title, content, and keywords fields to be used for semantic ranking, captions, highlights, and answers.
Name | Type | Description |
---|---|---|
prioritizedContentFields |
Defines the content fields to be used for semantic ranking, captions, highlights, and answers. For the best result, the selected fields should contain text in natural language form. The order of the fields in the array represents their priority. Fields with lower priority may get truncated if the content is long. |
|
prioritizedKeywordsFields |
Defines the keyword fields to be used for semantic ranking, captions, highlights, and answers. For the best result, the selected fields should contain a list of keywords. The order of the fields in the array represents their priority. Fields with lower priority may get truncated if the content is long. |
|
titleField |
Defines the title field to be used for semantic ranking, captions, highlights, and answers. If you don't have a title field in your index, leave this blank. |
RegexFlags
Defines flags that can be combined to control how regular expressions are used in the pattern analyzer and pattern tokenizer.
Name | Type | Description |
---|---|---|
CANON_EQ |
string |
Enables canonical equivalence. |
CASE_INSENSITIVE |
string |
Enables case-insensitive matching. |
COMMENTS |
string |
Permits whitespace and comments in the pattern. |
DOTALL |
string |
Enables dotall mode. |
LITERAL |
string |
Enables literal parsing of the pattern. |
MULTILINE |
string |
Enables multiline mode. |
UNICODE_CASE |
string |
Enables Unicode-aware case folding. |
UNIX_LINES |
string |
Enables Unix lines mode. |
ScalarQuantizationParameters
Contains the parameters specific to Scalar Quantization.
Name | Type | Description |
---|---|---|
quantizedDataType |
The quantized data type of compressed vector values. |
ScalarQuantizationVectorSearchCompressionConfiguration
Contains configuration options specific to the scalar quantization compression method used during indexing and querying.
Name | Type | Default value | Description |
---|---|---|---|
defaultOversampling |
number |
Default oversampling factor. Oversampling will internally request more documents (specified by this multiplier) in the initial search. This increases the set of results that will be reranked using recomputed similarity scores from full-precision vectors. Minimum value is 1, meaning no oversampling (1x). This parameter can only be set when rerankWithOriginalVectors is true. Higher values improve recall at the expense of latency. |
|
kind |
string:
scalar |
The name of the kind of compression method being configured for use with vector search. |
|
name |
string |
The name to associate with this particular configuration. |
|
rerankWithOriginalVectors |
boolean |
True |
If set to true, once the ordered set of results calculated using compressed vectors are obtained, they will be reranked again by recalculating the full-precision similarity scores. This will improve recall at the expense of latency. |
scalarQuantizationParameters |
Contains the parameters specific to Scalar Quantization. |
ScoringFunctionAggregation
Defines the aggregation function used to combine the results of all the scoring functions in a scoring profile.
Name | Type | Description |
---|---|---|
average |
string |
Boost scores by the average of all scoring function results. |
firstMatching |
string |
Boost scores using the first applicable scoring function in the scoring profile. |
maximum |
string |
Boost scores by the maximum of all scoring function results. |
minimum |
string |
Boost scores by the minimum of all scoring function results. |
sum |
string |
Boost scores by the sum of all scoring function results. |
ScoringFunctionInterpolation
Defines the function used to interpolate score boosting across a range of documents.
Name | Type | Description |
---|---|---|
constant |
string |
Boosts scores by a constant factor. |
linear |
string |
Boosts scores by a linearly decreasing amount. This is the default interpolation for scoring functions. |
logarithmic |
string |
Boosts scores by an amount that decreases logarithmically. Boosts decrease quickly for higher scores, and more slowly as the scores decrease. This interpolation option is not allowed in tag scoring functions. |
quadratic |
string |
Boosts scores by an amount that decreases quadratically. Boosts decrease slowly for higher scores, and more quickly as the scores decrease. This interpolation option is not allowed in tag scoring functions. |
ScoringProfile
Defines parameters for a search index that influence scoring in search queries.
Name | Type | Description |
---|---|---|
functionAggregation |
A value indicating how the results of individual scoring functions should be combined. Defaults to "Sum". Ignored if there are no scoring functions. |
|
functions | ScoringFunction[]: |
The collection of functions that influence the scoring of documents. |
name |
string |
The name of the scoring profile. |
text |
Parameters that boost scoring based on text matches in certain index fields. |
SearchField
Represents a field in an index definition, which describes the name, data type, and search behavior of a field.
Name | Type | Description |
---|---|---|
analyzer |
The name of the analyzer to use for the field. This option can be used only with searchable fields and it can't be set together with either searchAnalyzer or indexAnalyzer. Once the analyzer is chosen, it cannot be changed for the field. Must be null for complex fields. |
|
dimensions |
integer |
The dimensionality of the vector field. |
facetable |
boolean |
A value indicating whether to enable the field to be referenced in facet queries. Typically used in a presentation of search results that includes hit count by category (for example, search for digital cameras and see hits by brand, by megapixels, by price, and so on). This property must be null for complex fields. Fields of type Edm.GeographyPoint or Collection(Edm.GeographyPoint) cannot be facetable. Default is true for all other simple fields. |
fields |
A list of sub-fields if this is a field of type Edm.ComplexType or Collection(Edm.ComplexType). Must be null or empty for simple fields. |
|
filterable |
boolean |
A value indicating whether to enable the field to be referenced in $filter queries. filterable differs from searchable in how strings are handled. Fields of type Edm.String or Collection(Edm.String) that are filterable do not undergo word-breaking, so comparisons are for exact matches only. For example, if you set such a field f to "sunny day", $filter=f eq 'sunny' will find no matches, but $filter=f eq 'sunny day' will. This property must be null for complex fields. Default is true for simple fields and null for complex fields. |
indexAnalyzer |
The name of the analyzer used at indexing time for the field. This option can be used only with searchable fields. It must be set together with searchAnalyzer and it cannot be set together with the analyzer option. This property cannot be set to the name of a language analyzer; use the analyzer property instead if you need a language analyzer. Once the analyzer is chosen, it cannot be changed for the field. Must be null for complex fields. |
|
key |
boolean |
A value indicating whether the field uniquely identifies documents in the index. Exactly one top-level field in each index must be chosen as the key field and it must be of type Edm.String. Key fields can be used to look up documents directly and update or delete specific documents. Default is false for simple fields and null for complex fields. |
name |
string |
The name of the field, which must be unique within the fields collection of the index or parent field. |
retrievable |
boolean |
A value indicating whether the field can be returned in a search result. You can disable this option if you want to use a field (for example, margin) as a filter, sorting, or scoring mechanism but do not want the field to be visible to the end user. This property must be true for key fields, and it must be null for complex fields. This property can be changed on existing fields. Enabling this property does not cause any increase in index storage requirements. Default is true for simple fields, false for vector fields, and null for complex fields. |
searchAnalyzer |
The name of the analyzer used at search time for the field. This option can be used only with searchable fields. It must be set together with indexAnalyzer and it cannot be set together with the analyzer option. This property cannot be set to the name of a language analyzer; use the analyzer property instead if you need a language analyzer. This analyzer can be updated on an existing field. Must be null for complex fields. |
|
searchable |
boolean |
A value indicating whether the field is full-text searchable. This means it will undergo analysis such as word-breaking during indexing. If you set a searchable field to a value like "sunny day", internally it will be split into the individual tokens "sunny" and "day". This enables full-text searches for these terms. Fields of type Edm.String or Collection(Edm.String) are searchable by default. This property must be false for simple fields of other non-string data types, and it must be null for complex fields. Note: searchable fields consume extra space in your index to accommodate additional tokenized versions of the field value for full-text searches. If you want to save space in your index and you don't need a field to be included in searches, set searchable to false. |
sortable |
boolean |
A value indicating whether to enable the field to be referenced in $orderby expressions. By default, the search engine sorts results by score, but in many experiences users will want to sort by fields in the documents. A simple field can be sortable only if it is single-valued (it has a single value in the scope of the parent document). Simple collection fields cannot be sortable, since they are multi-valued. Simple sub-fields of complex collections are also multi-valued, and therefore cannot be sortable. This is true whether it's an immediate parent field, or an ancestor field, that's the complex collection. Complex fields cannot be sortable and the sortable property must be null for such fields. The default for sortable is true for single-valued simple fields, false for multi-valued simple fields, and null for complex fields. |
stored |
boolean |
An immutable value indicating whether the field will be persisted separately on disk to be returned in a search result. You can disable this option if you don't plan to return the field contents in a search response to save on storage overhead. This can only be set during index creation and only for vector fields. This property cannot be changed for existing fields or set as false for new fields. If this property is set as false, the property 'retrievable' must also be set to false. This property must be true or unset for key fields, for new fields, and for non-vector fields, and it must be null for complex fields. Disabling this property will reduce index storage requirements. The default is true for vector fields. |
synonymMaps |
string[] |
A list of the names of synonym maps to associate with this field. This option can be used only with searchable fields. Currently only one synonym map per field is supported. Assigning a synonym map to a field ensures that query terms targeting that field are expanded at query-time using the rules in the synonym map. This attribute can be changed on existing fields. Must be null or an empty collection for complex fields. |
type |
The data type of the field. |
|
vectorEncoding |
The encoding format to interpret the field contents. |
|
vectorSearchProfile |
string |
The name of the vector search profile that specifies the algorithm and vectorizer to use when searching the vector field. |
SearchFieldDataType
Defines the data type of a field in a search index.
Name | Type | Description |
---|---|---|
Edm.Boolean |
string |
Indicates that a field contains a Boolean value (true or false). |
Edm.Byte |
string |
Indicates that a field contains a 8-bit unsigned integer. This is only valid when used with Collection(Edm.Byte). |
Edm.ComplexType |
string |
Indicates that a field contains one or more complex objects that in turn have sub-fields of other types. |
Edm.DateTimeOffset |
string |
Indicates that a field contains a date/time value, including timezone information. |
Edm.Double |
string |
Indicates that a field contains an IEEE double-precision floating point number. |
Edm.GeographyPoint |
string |
Indicates that a field contains a geo-location in terms of longitude and latitude. |
Edm.Half |
string |
Indicates that a field contains a half-precision floating point number. This is only valid when used with Collection(Edm.Half). |
Edm.Int16 |
string |
Indicates that a field contains a 16-bit signed integer. This is only valid when used with Collection(Edm.Int16). |
Edm.Int32 |
string |
Indicates that a field contains a 32-bit signed integer. |
Edm.Int64 |
string |
Indicates that a field contains a 64-bit signed integer. |
Edm.SByte |
string |
Indicates that a field contains a 8-bit signed integer. This is only valid when used with Collection(Edm.SByte). |
Edm.Single |
string |
Indicates that a field contains a single-precision floating point number. This is only valid when used with Collection(Edm.Single). |
Edm.String |
string |
Indicates that a field contains a string. |
SearchIndex
Represents a search index definition, which describes the fields and search behavior of an index.
Name | Type | Description |
---|---|---|
@odata.etag |
string |
The ETag of the index. |
analyzers | LexicalAnalyzer[]: |
The analyzers for the index. |
charFilters | CharFilter[]: |
The character filters for the index. |
corsOptions |
Options to control Cross-Origin Resource Sharing (CORS) for the index. |
|
defaultScoringProfile |
string |
The name of the scoring profile to use if none is specified in the query. If this property is not set and no scoring profile is specified in the query, then default scoring (tf-idf) will be used. |
encryptionKey |
A description of an encryption key that you create in Azure Key Vault. This key is used to provide an additional level of encryption-at-rest for your data when you want full assurance that no one, not even Microsoft, can decrypt your data. Once you have encrypted your data, it will always remain encrypted. The search service will ignore attempts to set this property to null. You can change this property as needed if you want to rotate your encryption key; Your data will be unaffected. Encryption with customer-managed keys is not available for free search services, and is only available for paid services created on or after January 1, 2019. |
|
fields |
The fields of the index. |
|
name |
string |
The name of the index. |
scoringProfiles |
The scoring profiles for the index. |
|
semantic |
Defines parameters for a search index that influence semantic capabilities. |
|
similarity | Similarity: |
The type of similarity algorithm to be used when scoring and ranking the documents matching a search query. The similarity algorithm can only be defined at index creation time and cannot be modified on existing indexes. If null, the ClassicSimilarity algorithm is used. |
suggesters |
The suggesters for the index. |
|
tokenFilters |
TokenFilter[]:
|
The token filters for the index. |
tokenizers | LexicalTokenizer[]: |
The tokenizers for the index. |
vectorSearch |
Contains configuration options related to vector search. |
SearchIndexerDataNoneIdentity
Clears the identity property of a datasource.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of identity. |
SearchIndexerDataUserAssignedIdentity
Specifies the identity for a datasource to use.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of identity. |
userAssignedIdentity |
string |
The fully qualified Azure resource Id of a user assigned managed identity typically in the form "/subscriptions/12345678-1234-1234-1234-1234567890ab/resourceGroups/rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/myId" that should have been assigned to the search service. |
SearchResourceEncryptionKey
A customer-managed encryption key in Azure Key Vault. Keys that you create and manage can be used to encrypt or decrypt data-at-rest, such as indexes and synonym maps.
Name | Type | Description |
---|---|---|
accessCredentials |
Optional Azure Active Directory credentials used for accessing your Azure Key Vault. Not required if using managed identity instead. |
|
keyVaultKeyName |
string |
The name of your Azure Key Vault key to be used to encrypt your data at rest. |
keyVaultKeyVersion |
string |
The version of your Azure Key Vault key to be used to encrypt your data at rest. |
keyVaultUri |
string |
The URI of your Azure Key Vault, also referred to as DNS name, that contains the key to be used to encrypt your data at rest. An example URI might be |
SemanticConfiguration
Defines a specific configuration to be used in the context of semantic capabilities.
Name | Type | Description |
---|---|---|
name |
string |
The name of the semantic configuration. |
prioritizedFields |
Describes the title, content, and keyword fields to be used for semantic ranking, captions, highlights, and answers. At least one of the three sub properties (titleField, prioritizedKeywordsFields and prioritizedContentFields) need to be set. |
SemanticField
A field that is used as part of the semantic configuration.
Name | Type | Description |
---|---|---|
fieldName |
string |
SemanticSettings
Defines parameters for a search index that influence semantic capabilities.
Name | Type | Description |
---|---|---|
configurations |
The semantic configurations for the index. |
|
defaultConfiguration |
string |
Allows you to set the name of a default semantic configuration in your index, making it optional to pass it on as a query parameter every time. |
ShingleTokenFilter
Creates combinations of tokens as a single token. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
filterToken |
string |
_ |
The string to insert for each position at which there is no token. Default is an underscore ("_"). |
maxShingleSize |
integer |
2 |
The maximum shingle size. Default and minimum value is 2. |
minShingleSize |
integer |
2 |
The minimum shingle size. Default and minimum value is 2. Must be less than the value of maxShingleSize. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
outputUnigrams |
boolean |
True |
A value indicating whether the output stream will contain the input tokens (unigrams) as well as shingles. Default is true. |
outputUnigramsIfNoShingles |
boolean |
False |
A value indicating whether to output unigrams for those times when no shingles are available. This property takes precedence when outputUnigrams is set to false. Default is false. |
tokenSeparator |
string |
The string to use when joining adjacent tokens to form a shingle. Default is a single space (" "). |
SnowballTokenFilter
A filter that stems words using a Snowball-generated stemmer. This token filter is implemented using Apache Lucene.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
language |
The language to use. |
|
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
SnowballTokenFilterLanguage
The language to use for a Snowball token filter.
Name | Type | Description |
---|---|---|
armenian |
string |
Selects the Lucene Snowball stemming tokenizer for Armenian. |
basque |
string |
Selects the Lucene Snowball stemming tokenizer for Basque. |
catalan |
string |
Selects the Lucene Snowball stemming tokenizer for Catalan. |
danish |
string |
Selects the Lucene Snowball stemming tokenizer for Danish. |
dutch |
string |
Selects the Lucene Snowball stemming tokenizer for Dutch. |
english |
string |
Selects the Lucene Snowball stemming tokenizer for English. |
finnish |
string |
Selects the Lucene Snowball stemming tokenizer for Finnish. |
french |
string |
Selects the Lucene Snowball stemming tokenizer for French. |
german |
string |
Selects the Lucene Snowball stemming tokenizer for German. |
german2 |
string |
Selects the Lucene Snowball stemming tokenizer that uses the German variant algorithm. |
hungarian |
string |
Selects the Lucene Snowball stemming tokenizer for Hungarian. |
italian |
string |
Selects the Lucene Snowball stemming tokenizer for Italian. |
kp |
string |
Selects the Lucene Snowball stemming tokenizer for Dutch that uses the Kraaij-Pohlmann stemming algorithm. |
lovins |
string |
Selects the Lucene Snowball stemming tokenizer for English that uses the Lovins stemming algorithm. |
norwegian |
string |
Selects the Lucene Snowball stemming tokenizer for Norwegian. |
porter |
string |
Selects the Lucene Snowball stemming tokenizer for English that uses the Porter stemming algorithm. |
portuguese |
string |
Selects the Lucene Snowball stemming tokenizer for Portuguese. |
romanian |
string |
Selects the Lucene Snowball stemming tokenizer for Romanian. |
russian |
string |
Selects the Lucene Snowball stemming tokenizer for Russian. |
spanish |
string |
Selects the Lucene Snowball stemming tokenizer for Spanish. |
swedish |
string |
Selects the Lucene Snowball stemming tokenizer for Swedish. |
turkish |
string |
Selects the Lucene Snowball stemming tokenizer for Turkish. |
StemmerOverrideTokenFilter
Provides the ability to override other stemming filters with custom dictionary-based stemming. Any dictionary-stemmed terms will be marked as keywords so that they will not be stemmed with stemmers down the chain. Must be placed before any stemming filters. This token filter is implemented using Apache Lucene.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
rules |
string[] |
A list of stemming rules in the following format: "word => stem", for example: "ran => run". |
StemmerTokenFilter
Language specific stemming filter. This token filter is implemented using Apache Lucene.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
language |
The language to use. |
|
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
StemmerTokenFilterLanguage
The language to use for a stemmer token filter.
Name | Type | Description |
---|---|---|
arabic |
string |
Selects the Lucene stemming tokenizer for Arabic. |
armenian |
string |
Selects the Lucene stemming tokenizer for Armenian. |
basque |
string |
Selects the Lucene stemming tokenizer for Basque. |
brazilian |
string |
Selects the Lucene stemming tokenizer for Portuguese (Brazil). |
bulgarian |
string |
Selects the Lucene stemming tokenizer for Bulgarian. |
catalan |
string |
Selects the Lucene stemming tokenizer for Catalan. |
czech |
string |
Selects the Lucene stemming tokenizer for Czech. |
danish |
string |
Selects the Lucene stemming tokenizer for Danish. |
dutch |
string |
Selects the Lucene stemming tokenizer for Dutch. |
dutchKp |
string |
Selects the Lucene stemming tokenizer for Dutch that uses the Kraaij-Pohlmann stemming algorithm. |
english |
string |
Selects the Lucene stemming tokenizer for English. |
finnish |
string |
Selects the Lucene stemming tokenizer for Finnish. |
french |
string |
Selects the Lucene stemming tokenizer for French. |
galician |
string |
Selects the Lucene stemming tokenizer for Galician. |
german |
string |
Selects the Lucene stemming tokenizer for German. |
german2 |
string |
Selects the Lucene stemming tokenizer that uses the German variant algorithm. |
greek |
string |
Selects the Lucene stemming tokenizer for Greek. |
hindi |
string |
Selects the Lucene stemming tokenizer for Hindi. |
hungarian |
string |
Selects the Lucene stemming tokenizer for Hungarian. |
indonesian |
string |
Selects the Lucene stemming tokenizer for Indonesian. |
irish |
string |
Selects the Lucene stemming tokenizer for Irish. |
italian |
string |
Selects the Lucene stemming tokenizer for Italian. |
latvian |
string |
Selects the Lucene stemming tokenizer for Latvian. |
lightEnglish |
string |
Selects the Lucene stemming tokenizer for English that does light stemming. |
lightFinnish |
string |
Selects the Lucene stemming tokenizer for Finnish that does light stemming. |
lightFrench |
string |
Selects the Lucene stemming tokenizer for French that does light stemming. |
lightGerman |
string |
Selects the Lucene stemming tokenizer for German that does light stemming. |
lightHungarian |
string |
Selects the Lucene stemming tokenizer for Hungarian that does light stemming. |
lightItalian |
string |
Selects the Lucene stemming tokenizer for Italian that does light stemming. |
lightNorwegian |
string |
Selects the Lucene stemming tokenizer for Norwegian (Bokmål) that does light stemming. |
lightNynorsk |
string |
Selects the Lucene stemming tokenizer for Norwegian (Nynorsk) that does light stemming. |
lightPortuguese |
string |
Selects the Lucene stemming tokenizer for Portuguese that does light stemming. |
lightRussian |
string |
Selects the Lucene stemming tokenizer for Russian that does light stemming. |
lightSpanish |
string |
Selects the Lucene stemming tokenizer for Spanish that does light stemming. |
lightSwedish |
string |
Selects the Lucene stemming tokenizer for Swedish that does light stemming. |
lovins |
string |
Selects the Lucene stemming tokenizer for English that uses the Lovins stemming algorithm. |
minimalEnglish |
string |
Selects the Lucene stemming tokenizer for English that does minimal stemming. |
minimalFrench |
string |
Selects the Lucene stemming tokenizer for French that does minimal stemming. |
minimalGalician |
string |
Selects the Lucene stemming tokenizer for Galician that does minimal stemming. |
minimalGerman |
string |
Selects the Lucene stemming tokenizer for German that does minimal stemming. |
minimalNorwegian |
string |
Selects the Lucene stemming tokenizer for Norwegian (Bokmål) that does minimal stemming. |
minimalNynorsk |
string |
Selects the Lucene stemming tokenizer for Norwegian (Nynorsk) that does minimal stemming. |
minimalPortuguese |
string |
Selects the Lucene stemming tokenizer for Portuguese that does minimal stemming. |
norwegian |
string |
Selects the Lucene stemming tokenizer for Norwegian (Bokmål). |
porter2 |
string |
Selects the Lucene stemming tokenizer for English that uses the Porter2 stemming algorithm. |
portuguese |
string |
Selects the Lucene stemming tokenizer for Portuguese. |
portugueseRslp |
string |
Selects the Lucene stemming tokenizer for Portuguese that uses the RSLP stemming algorithm. |
possessiveEnglish |
string |
Selects the Lucene stemming tokenizer for English that removes trailing possessives from words. |
romanian |
string |
Selects the Lucene stemming tokenizer for Romanian. |
russian |
string |
Selects the Lucene stemming tokenizer for Russian. |
sorani |
string |
Selects the Lucene stemming tokenizer for Sorani. |
spanish |
string |
Selects the Lucene stemming tokenizer for Spanish. |
swedish |
string |
Selects the Lucene stemming tokenizer for Swedish. |
turkish |
string |
Selects the Lucene stemming tokenizer for Turkish. |
StopAnalyzer
Divides text at non-letters; Applies the lowercase and stopword token filters. This analyzer is implemented using Apache Lucene.
Name | Type | Description |
---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of analyzer. |
name |
string |
The name of the analyzer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
stopwords |
string[] |
A list of stopwords. |
StopwordsList
Identifies a predefined list of language-specific stopwords.
Name | Type | Description |
---|---|---|
arabic |
string |
Selects the stopword list for Arabic. |
armenian |
string |
Selects the stopword list for Armenian. |
basque |
string |
Selects the stopword list for Basque. |
brazilian |
string |
Selects the stopword list for Portuguese (Brazil). |
bulgarian |
string |
Selects the stopword list for Bulgarian. |
catalan |
string |
Selects the stopword list for Catalan. |
czech |
string |
Selects the stopword list for Czech. |
danish |
string |
Selects the stopword list for Danish. |
dutch |
string |
Selects the stopword list for Dutch. |
english |
string |
Selects the stopword list for English. |
finnish |
string |
Selects the stopword list for Finnish. |
french |
string |
Selects the stopword list for French. |
galician |
string |
Selects the stopword list for Galician. |
german |
string |
Selects the stopword list for German. |
greek |
string |
Selects the stopword list for Greek. |
hindi |
string |
Selects the stopword list for Hindi. |
hungarian |
string |
Selects the stopword list for Hungarian. |
indonesian |
string |
Selects the stopword list for Indonesian. |
irish |
string |
Selects the stopword list for Irish. |
italian |
string |
Selects the stopword list for Italian. |
latvian |
string |
Selects the stopword list for Latvian. |
norwegian |
string |
Selects the stopword list for Norwegian. |
persian |
string |
Selects the stopword list for Persian. |
portuguese |
string |
Selects the stopword list for Portuguese. |
romanian |
string |
Selects the stopword list for Romanian. |
russian |
string |
Selects the stopword list for Russian. |
sorani |
string |
Selects the stopword list for Sorani. |
spanish |
string |
Selects the stopword list for Spanish. |
swedish |
string |
Selects the stopword list for Swedish. |
thai |
string |
Selects the stopword list for Thai. |
turkish |
string |
Selects the stopword list for Turkish. |
StopwordsTokenFilter
Removes stop words from a token stream. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
ignoreCase |
boolean |
False |
A value indicating whether to ignore case. If true, all words are converted to lower case first. Default is false. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
removeTrailing |
boolean |
True |
A value indicating whether to ignore the last search term if it's a stop word. Default is true. |
stopwords |
string[] |
The list of stopwords. This property and the stopwords list property cannot both be set. |
|
stopwordsList | english |
A predefined list of stopwords to use. This property and the stopwords property cannot both be set. Default is English. |
Suggester
Defines how the Suggest API should apply to a group of fields in the index.
Name | Type | Description |
---|---|---|
name |
string |
The name of the suggester. |
searchMode |
A value indicating the capabilities of the suggester. |
|
sourceFields |
string[] |
The list of field names to which the suggester applies. Each field must be searchable. |
SuggesterSearchMode
A value indicating the capabilities of the suggester.
Name | Type | Description |
---|---|---|
analyzingInfixMatching |
string |
Matches consecutive whole terms and prefixes in a field. For example, for the field 'The fastest brown fox', the queries 'fast' and 'fastest brow' would both match. |
SynonymTokenFilter
Matches single or multi-word synonyms in a token stream. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
expand |
boolean |
True |
A value indicating whether all words in the list of synonyms (if => notation is not used) will map to one another. If true, all words in the list of synonyms (if => notation is not used) will map to one another. The following list: incredible, unbelievable, fabulous, amazing is equivalent to: incredible, unbelievable, fabulous, amazing => incredible, unbelievable, fabulous, amazing. If false, the following list: incredible, unbelievable, fabulous, amazing will be equivalent to: incredible, unbelievable, fabulous, amazing => incredible. Default is true. |
ignoreCase |
boolean |
False |
A value indicating whether to case-fold input for matching. Default is false. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
synonyms |
string[] |
A list of synonyms in following one of two formats: 1. incredible, unbelievable, fabulous => amazing - all terms on the left side of => symbol will be replaced with all terms on its right side; 2. incredible, unbelievable, fabulous, amazing - comma separated list of equivalent words. Set the expand option to change how this list is interpreted. |
TagScoringFunction
Defines a function that boosts scores of documents with string values matching a given list of tags.
Name | Type | Description |
---|---|---|
boost |
number |
A multiplier for the raw score. Must be a positive number not equal to 1.0. |
fieldName |
string |
The name of the field used as input to the scoring function. |
interpolation |
A value indicating how boosting will be interpolated across document scores; defaults to "Linear". |
|
tag |
Parameter values for the tag scoring function. |
|
type |
string:
tag |
Indicates the type of function to use. Valid values include magnitude, freshness, distance, and tag. The function type must be lower case. |
TagScoringParameters
Provides parameter values to a tag scoring function.
Name | Type | Description |
---|---|---|
tagsParameter |
string |
The name of the parameter passed in search queries to specify the list of tags to compare against the target field. |
TextWeights
Defines weights on index fields for which matches should boost scoring in search queries.
Name | Type | Description |
---|---|---|
weights |
object |
The dictionary of per-field weights to boost document scoring. The keys are field names and the values are the weights for each field. |
TokenCharacterKind
Represents classes of characters on which a token filter can operate.
Name | Type | Description |
---|---|---|
digit |
string |
Keeps digits in tokens. |
letter |
string |
Keeps letters in tokens. |
punctuation |
string |
Keeps punctuation in tokens. |
symbol |
string |
Keeps symbols in tokens. |
whitespace |
string |
Keeps whitespace in tokens. |
TokenFilterName
Defines the names of all token filters supported by the search engine.
TruncateTokenFilter
Truncates the terms to a specific length. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
length |
integer |
300 |
The length at which terms will be truncated. Default and maximum is 300. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
UaxUrlEmailTokenizer
Tokenizes urls and emails as one token. This tokenizer is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of tokenizer. |
|
maxTokenLength |
integer |
255 |
The maximum token length. Default is 255. Tokens longer than the maximum length are split. The maximum token length that can be used is 300 characters. |
name |
string |
The name of the tokenizer. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
UniqueTokenFilter
Filters out tokens with same text as the previous token. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
onlyOnSamePosition |
boolean |
False |
A value indicating whether to remove duplicates only at the same position. Default is false. |
VectorEncodingFormat
The encoding format for interpreting vector field contents.
Name | Type | Description |
---|---|---|
packedBit |
string |
Encoding format representing bits packed into a wider data type. |
VectorSearch
Contains configuration options related to vector search.
Name | Type | Description |
---|---|---|
algorithms | VectorSearchAlgorithmConfiguration[]: |
Contains configuration options specific to the algorithm used during indexing or querying. |
compressions | VectorSearchCompressionConfiguration[]: |
Contains configuration options specific to the compression method used during indexing or querying. |
profiles |
Defines combinations of configurations to use with vector search. |
|
vectorizers | VectorSearchVectorizer[]: |
Contains configuration options on how to vectorize text vector queries. |
VectorSearchAlgorithmKind
The algorithm used for indexing and querying.
Name | Type | Description |
---|---|---|
exhaustiveKnn |
string |
Exhaustive KNN algorithm which will perform brute-force search. |
hnsw |
string |
HNSW (Hierarchical Navigable Small World), a type of approximate nearest neighbors algorithm. |
VectorSearchAlgorithmMetric
The similarity metric to use for vector comparisons. It is recommended to choose the same similarity metric as the embedding model was trained on.
Name | Type | Description |
---|---|---|
cosine |
string |
Measures the angle between vectors to quantify their similarity, disregarding magnitude. The smaller the angle, the closer the similarity. |
dotProduct |
string |
Calculates the sum of element-wise products to gauge alignment and magnitude similarity. The larger and more positive, the closer the similarity. |
euclidean |
string |
Computes the straight-line distance between vectors in a multi-dimensional space. The smaller the distance, the closer the similarity. |
hamming |
string |
Only applicable to bit-packed binary data types. Determines dissimilarity by counting differing positions in binary vectors. The fewer differences, the closer the similarity. |
VectorSearchCompressionKind
The compression method used for indexing and querying.
Name | Type | Description |
---|---|---|
binaryQuantization |
string |
Binary Quantization, a type of compression method. In binary quantization, the original vectors values are compressed to the narrower binary type by discretizing and representing each component of a vector using binary values, thereby reducing the overall data size. |
scalarQuantization |
string |
Scalar Quantization, a type of compression method. In scalar quantization, the original vectors values are compressed to a narrower type by discretizing and representing each component of a vector using a reduced set of quantized values, thereby reducing the overall data size. |
VectorSearchCompressionTargetDataType
The quantized data type of compressed vector values.
Name | Type | Description |
---|---|---|
int8 |
string |
VectorSearchProfile
Defines a combination of configurations to use with vector search.
Name | Type | Description |
---|---|---|
algorithm |
string |
The name of the vector search algorithm configuration that specifies the algorithm and optional parameters. |
compression |
string |
The name of the compression method configuration that specifies the compression method and optional parameters. |
name |
string |
The name to associate with this particular vector search profile. |
vectorizer |
string |
The name of the vectorization being configured for use with vector search. |
VectorSearchVectorizerKind
The vectorization method to be used during query time.
Name | Type | Description |
---|---|---|
azureOpenAI |
string |
Generate embeddings using an Azure OpenAI resource at query time. |
customWebApi |
string |
Generate embeddings using a custom web endpoint at query time. |
WebApiParameters
Specifies the properties for connecting to a user-defined vectorizer.
Name | Type | Description |
---|---|---|
authIdentity | SearchIndexerDataIdentity: |
The user-assigned managed identity used for outbound connections. If an authResourceId is provided and it's not specified, the system-assigned managed identity is used. On updates to the indexer, if the identity is unspecified, the value remains unchanged. If set to "none", the value of this property is cleared. |
authResourceId |
string |
Applies to custom endpoints that connect to external code in an Azure function or some other application that provides the transformations. This value should be the application ID created for the function or app when it was registered with Azure Active Directory. When specified, the vectorization connects to the function or app using a managed ID (either system or user-assigned) of the search service and the access token of the function or app, using this value as the resource id for creating the scope of the access token. |
httpHeaders |
object |
The headers required to make the HTTP request. |
httpMethod |
string |
The method for the HTTP request. |
timeout |
string |
The desired timeout for the request. Default is 30 seconds. |
uri |
string |
The URI of the Web API providing the vectorizer. |
WebApiVectorizer
Specifies a user-defined vectorizer for generating the vector embedding of a query string. Integration of an external vectorizer is achieved using the custom Web API interface of a skillset.
Name | Type | Description |
---|---|---|
customWebApiParameters |
Specifies the properties of the user-defined vectorizer. |
|
kind |
string:
custom |
The name of the kind of vectorization method being configured for use with vector search. |
name |
string |
The name to associate with this particular vectorization method. |
WordDelimiterTokenFilter
Splits words into subwords and performs optional transformations on subword groups. This token filter is implemented using Apache Lucene.
Name | Type | Default value | Description |
---|---|---|---|
@odata.type |
string:
#Microsoft. |
A URI fragment specifying the type of token filter. |
|
catenateAll |
boolean |
False |
A value indicating whether all subword parts will be catenated. For example, if this is set to true, "Azure-Search-1" becomes "AzureSearch1". Default is false. |
catenateNumbers |
boolean |
False |
A value indicating whether maximum runs of number parts will be catenated. For example, if this is set to true, "1-2" becomes "12". Default is false. |
catenateWords |
boolean |
False |
A value indicating whether maximum runs of word parts will be catenated. For example, if this is set to true, "Azure-Search" becomes "AzureSearch". Default is false. |
generateNumberParts |
boolean |
True |
A value indicating whether to generate number subwords. Default is true. |
generateWordParts |
boolean |
True |
A value indicating whether to generate part words. If set, causes parts of words to be generated; for example "AzureSearch" becomes "Azure" "Search". Default is true. |
name |
string |
The name of the token filter. It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters. |
|
preserveOriginal |
boolean |
False |
A value indicating whether original words will be preserved and added to the subword list. Default is false. |
protectedWords |
string[] |
A list of tokens to protect from being delimited. |
|
splitOnCaseChange |
boolean |
True |
A value indicating whether to split words on caseChange. For example, if this is set to true, "AzureSearch" becomes "Azure" "Search". Default is true. |
splitOnNumerics |
boolean |
True |
A value indicating whether to split on numbers. For example, if this is set to true, "Azure1Search" becomes "Azure" "1" "Search". Default is true. |
stemEnglishPossessive |
boolean |
True |
A value indicating whether to remove trailing "'s" for each subword. Default is true. |