Indexes - Analyze

Service:: Search Service

API Version:: 2024-07-01

Shows how an analyzer breaks text into tokens.

POST {endpoint}/indexes('{indexName}')/search.analyze?api-version=2024-07-01

URI Parameters

Name	In	Required	Type	Description
endpoint	path	True	string	The endpoint URL of the search service.
indexName	path	True	string	The name of the index for which to test an analyzer.
api-version	query	True	string	Client Api Version.

Request Header

Name	Required	Type	Description
x-ms-client-request-id		string (uuid)	The tracking ID sent with the request to help with debugging.

Request Body

Name	Required	Type	Description
text	True	string	The text to break into tokens.
analyzer		LexicalAnalyzerName	The name of the analyzer to use to break the given text. If this parameter is not specified, you must specify a tokenizer instead. The tokenizer and analyzer parameters are mutually exclusive.
charFilters		CharFilterName[]	An optional list of character filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.
tokenFilters		TokenFilterName[]	An optional list of token filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.
tokenizer		LexicalTokenizerName	The name of the tokenizer to use to break the given text. If this parameter is not specified, you must specify an analyzer instead. The tokenizer and analyzer parameters are mutually exclusive.

Responses

Name	Type	Description
200 OK	AnalyzeResult
Other Status Codes	ErrorResponse	Error response.

Examples

SearchServiceIndexAnalyze

Sample request

HTTP

POST https://myservice.search.windows.net/indexes('hotels')/search.analyze?api-version=2024-07-01

{
  "text": "Text to analyze",
  "analyzer": "standard.lucene"
}

Sample response

Status code:: 200

{
  "tokens": [
    {
      "token": "text",
      "startOffset": 0,
      "endOffset": 4,
      "position": 0
    },
    {
      "token": "to",
      "startOffset": 5,
      "endOffset": 7,
      "position": 1
    },
    {
      "token": "analyze",
      "startOffset": 8,
      "endOffset": 15,
      "position": 2
    }
  ]
}

Definitions

Name	Description
AnalyzedTokenInfo	Information about a token returned by an analyzer.
AnalyzeRequest	Specifies some text and analysis components used to break that text into tokens.
AnalyzeResult	The result of testing an analyzer on text.
CharFilterName	Defines the names of all character filters supported by the search engine.
ErrorAdditionalInfo	The resource management error additional info.
ErrorDetail	The error detail.
ErrorResponse	Error response
LexicalAnalyzerName	Defines the names of all text analyzers supported by the search engine.
LexicalTokenizerName	Defines the names of all tokenizers supported by the search engine.
TokenFilterName	Defines the names of all token filters supported by the search engine.

AnalyzedTokenInfo

Object

Information about a token returned by an analyzer.

Name	Type	Description
endOffset	integer (int32)	The index of the last character of the token in the input text.
position	integer (int32)	The position of the token in the input text relative to other tokens. The first token in the input text has position 0, the next has position 1, and so on. Depending on the analyzer used, some tokens might have the same position, for example if they are synonyms of each other.
startOffset	integer (int32)	The index of the first character of the token in the input text.
token	string	The token returned by the analyzer.

AnalyzeRequest

Object

Specifies some text and analysis components used to break that text into tokens.

Name	Type	Description
analyzer	LexicalAnalyzerName	The name of the analyzer to use to break the given text. If this parameter is not specified, you must specify a tokenizer instead. The tokenizer and analyzer parameters are mutually exclusive.
charFilters	CharFilterName[]	An optional list of character filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.
text	string	The text to break into tokens.
tokenFilters	TokenFilterName[]	An optional list of token filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.
tokenizer	LexicalTokenizerName	The name of the tokenizer to use to break the given text. If this parameter is not specified, you must specify an analyzer instead. The tokenizer and analyzer parameters are mutually exclusive.

AnalyzeResult

Object

The result of testing an analyzer on text.

Name	Type	Description
tokens	AnalyzedTokenInfo[]	The list of tokens returned by the analyzer specified in the request.

CharFilterName

Enumeration

Defines the names of all character filters supported by the search engine.

Value	Description
html_strip	A character filter that attempts to strip out HTML constructs. See https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.html

ErrorAdditionalInfo

Object

The resource management error additional info.

Name	Type	Description
info	object	The additional info.
type	string	The additional info type.

ErrorDetail

Object

The error detail.

Name	Type	Description
additionalInfo	ErrorAdditionalInfo[]	The error additional info.
code	string	The error code.
details	ErrorDetail[]	The error details.
message	string	The error message.
target	string	The error target.

ErrorResponse

Object

Error response

Name	Type	Description
error	ErrorDetail	The error object.

LexicalAnalyzerName

Enumeration

Defines the names of all text analyzers supported by the search engine.

Value	Description
ar.microsoft	Microsoft analyzer for Arabic.
ar.lucene	Lucene analyzer for Arabic.
hy.lucene	Lucene analyzer for Armenian.
bn.microsoft	Microsoft analyzer for Bangla.
eu.lucene	Lucene analyzer for Basque.
bg.microsoft	Microsoft analyzer for Bulgarian.
bg.lucene	Lucene analyzer for Bulgarian.
ca.microsoft	Microsoft analyzer for Catalan.
ca.lucene	Lucene analyzer for Catalan.
zh-Hans.microsoft	Microsoft analyzer for Chinese (Simplified).
zh-Hans.lucene	Lucene analyzer for Chinese (Simplified).
zh-Hant.microsoft	Microsoft analyzer for Chinese (Traditional).
zh-Hant.lucene	Lucene analyzer for Chinese (Traditional).
hr.microsoft	Microsoft analyzer for Croatian.
cs.microsoft	Microsoft analyzer for Czech.
cs.lucene	Lucene analyzer for Czech.
da.microsoft	Microsoft analyzer for Danish.
da.lucene	Lucene analyzer for Danish.
nl.microsoft	Microsoft analyzer for Dutch.
nl.lucene	Lucene analyzer for Dutch.
en.microsoft	Microsoft analyzer for English.
en.lucene	Lucene analyzer for English.
et.microsoft	Microsoft analyzer for Estonian.
fi.microsoft	Microsoft analyzer for Finnish.
fi.lucene	Lucene analyzer for Finnish.
fr.microsoft	Microsoft analyzer for French.
fr.lucene	Lucene analyzer for French.
gl.lucene	Lucene analyzer for Galician.
de.microsoft	Microsoft analyzer for German.
de.lucene	Lucene analyzer for German.
el.microsoft	Microsoft analyzer for Greek.
el.lucene	Lucene analyzer for Greek.
gu.microsoft	Microsoft analyzer for Gujarati.
he.microsoft	Microsoft analyzer for Hebrew.
hi.microsoft	Microsoft analyzer for Hindi.
hi.lucene	Lucene analyzer for Hindi.
hu.microsoft	Microsoft analyzer for Hungarian.
hu.lucene	Lucene analyzer for Hungarian.
is.microsoft	Microsoft analyzer for Icelandic.
id.microsoft	Microsoft analyzer for Indonesian (Bahasa).
id.lucene	Lucene analyzer for Indonesian.
ga.lucene	Lucene analyzer for Irish.
it.microsoft	Microsoft analyzer for Italian.
it.lucene	Lucene analyzer for Italian.
ja.microsoft	Microsoft analyzer for Japanese.
ja.lucene	Lucene analyzer for Japanese.
kn.microsoft	Microsoft analyzer for Kannada.
ko.microsoft	Microsoft analyzer for Korean.
ko.lucene	Lucene analyzer for Korean.
lv.microsoft	Microsoft analyzer for Latvian.
lv.lucene	Lucene analyzer for Latvian.
lt.microsoft	Microsoft analyzer for Lithuanian.
ml.microsoft	Microsoft analyzer for Malayalam.
ms.microsoft	Microsoft analyzer for Malay (Latin).
mr.microsoft	Microsoft analyzer for Marathi.
nb.microsoft	Microsoft analyzer for Norwegian (Bokmål).
no.lucene	Lucene analyzer for Norwegian.
fa.lucene	Lucene analyzer for Persian.
pl.microsoft	Microsoft analyzer for Polish.
pl.lucene	Lucene analyzer for Polish.
pt-BR.microsoft	Microsoft analyzer for Portuguese (Brazil).
pt-BR.lucene	Lucene analyzer for Portuguese (Brazil).
pt-PT.microsoft	Microsoft analyzer for Portuguese (Portugal).
pt-PT.lucene	Lucene analyzer for Portuguese (Portugal).
pa.microsoft	Microsoft analyzer for Punjabi.
ro.microsoft	Microsoft analyzer for Romanian.
ro.lucene	Lucene analyzer for Romanian.
ru.microsoft	Microsoft analyzer for Russian.
ru.lucene	Lucene analyzer for Russian.
sr-cyrillic.microsoft	Microsoft analyzer for Serbian (Cyrillic).
sr-latin.microsoft	Microsoft analyzer for Serbian (Latin).
sk.microsoft	Microsoft analyzer for Slovak.
sl.microsoft	Microsoft analyzer for Slovenian.
es.microsoft	Microsoft analyzer for Spanish.
es.lucene	Lucene analyzer for Spanish.
sv.microsoft	Microsoft analyzer for Swedish.
sv.lucene	Lucene analyzer for Swedish.
ta.microsoft	Microsoft analyzer for Tamil.
te.microsoft	Microsoft analyzer for Telugu.
th.microsoft	Microsoft analyzer for Thai.
th.lucene	Lucene analyzer for Thai.
tr.microsoft	Microsoft analyzer for Turkish.
tr.lucene	Lucene analyzer for Turkish.
uk.microsoft	Microsoft analyzer for Ukrainian.
ur.microsoft	Microsoft analyzer for Urdu.
vi.microsoft	Microsoft analyzer for Vietnamese.
standard.lucene	Standard Lucene analyzer.
standardasciifolding.lucene	Standard ASCII Folding Lucene analyzer. See https://learn.microsoft.com/rest/api/searchservice/Custom-analyzers-in-Azure-Search#Analyzers
keyword	Treats the entire content of a field as a single token. This is useful for data like zip codes, ids, and some product names. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html
pattern	Flexibly separates text into terms via a regular expression pattern. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/PatternAnalyzer.html
simple	Divides text at non-letters and converts them to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/SimpleAnalyzer.html
stop	Divides text at non-letters; Applies the lowercase and stopword token filters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/StopAnalyzer.html
whitespace	An analyzer that uses the whitespace tokenizer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html

LexicalTokenizerName

Enumeration

Defines the names of all tokenizers supported by the search engine.

Value	Description
classic	Grammar-based tokenizer that is suitable for processing most European-language documents. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
edgeNGram	Tokenizes the input from an edge into n-grams of the given size(s). See https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html
keyword_v2	Emits the entire input as a single token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/KeywordTokenizer.html
letter	Divides text at non-letters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/LetterTokenizer.html
lowercase	Divides text at non-letters and converts them to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/LowerCaseTokenizer.html
microsoft_language_tokenizer	Divides text using language-specific rules.
microsoft_language_stemming_tokenizer	Divides text using language-specific rules and reduces words to their base forms.
nGram	Tokenizes the input into n-grams of the given size(s). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html
path_hierarchy_v2	Tokenizer for path-like hierarchies. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizer.html
pattern	Tokenizer that uses regex pattern matching to construct distinct tokens. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizer.html
standard_v2	Standard Lucene analyzer; Composed of the standard tokenizer, lowercase filter and stop filter. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html
uax_url_email	Tokenizes urls and emails as one token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizer.html
whitespace	Divides text at whitespace. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html

TokenFilterName

Enumeration

Defines the names of all token filters supported by the search engine.

Value	Description
arabic_normalization	A token filter that applies the Arabic normalizer to normalize the orthography. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizationFilter.html
apostrophe	Strips all characters after an apostrophe (including the apostrophe itself). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/tr/ApostropheFilter.html
asciifolding	Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if such equivalents exist. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html
cjk_bigram	Forms bigrams of CJK terms that are generated from the standard tokenizer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html
cjk_width	Normalizes CJK width differences. Folds fullwidth ASCII variants into the equivalent basic Latin, and half-width Katakana variants into the equivalent Kana. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html
classic	Removes English possessives, and dots from acronyms. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/ClassicFilter.html
common_grams	Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html
edgeNGram_v2	Generates n-grams of the given size(s) starting from the front or the back of an input token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html
elision	Removes elisions. For example, "l'avion" (the plane) will be converted to "avion" (plane). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html
german_normalization	Normalizes German characters according to the heuristics of the German2 snowball algorithm. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html
hindi_normalization	Normalizes text in Hindi to remove some differences in spelling variations. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizationFilter.html
indic_normalization	Normalizes the Unicode representation of text in Indian languages. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizationFilter.html
keyword_repeat	Emits each incoming token twice, once as keyword and once as non-keyword. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html
kstem	A high-performance kstem filter for English. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/en/KStemFilter.html
length	Removes words that are too long or too short. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html
limit	Limits the number of tokens while indexing. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html
lowercase	Normalizes token text to lower case. See https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html
nGram_v2	Generates n-grams of the given size(s). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
persian_normalization	Applies normalization for Persian. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizationFilter.html
phonetic	Create tokens for phonetic matches. See https://lucene.apache.org/core/4_10_3/analyzers-phonetic/org/apache/lucene/analysis/phonetic/package-tree.html
porter_stem	Uses the Porter stemming algorithm to transform the token stream. See http://tartarus.org/~martin/PorterStemmer
reverse	Reverses the token string. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html
scandinavian_normalization	Normalizes use of the interchangeable Scandinavian characters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html
scandinavian_folding	Folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminates against use of double vowels aa, ae, ao, oe and oo, leaving just the first one. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html
shingle	Creates combinations of tokens as a single token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
snowball	A filter that stems words using a Snowball-generated stemmer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/snowball/SnowballFilter.html
sorani_normalization	Normalizes the Unicode representation of Sorani text. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizationFilter.html
stemmer	Language specific stemming filter. See https://learn.microsoft.com/rest/api/searchservice/Custom-analyzers-in-Azure-Search#TokenFilters
stopwords	Removes stop words from a token stream. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html
trim	Trims leading and trailing whitespace from tokens. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html
truncate	Truncates the terms to a specific length. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html
unique	Filters out tokens with same text as the previous token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html
uppercase	Normalizes token text to upper case. See https://lucene.apache.org/core/6_6_1/analyzers-common/org/apache/lucene/analysis/core/UpperCaseFilter.html
word_delimiter	Splits words into subwords and performs optional transformations on subword groups.

Share via

Indexes - Analyze

URI Parameters

Request Header

Request Body

Responses

Examples

SearchServiceIndexAnalyze

Sample request

Sample response

Definitions

AnalyzedTokenInfo

AnalyzeRequest

AnalyzeResult

CharFilterName

ErrorAdditionalInfo

ErrorDetail

ErrorResponse

LexicalAnalyzerName

LexicalTokenizerName

TokenFilterName