Understand how OData collection filters work in Azure AI Search
This article provides background for developers who are writing advanced filters with complex lambda expressions. The article explains why the rules for collection filters exist by exploring how Azure AI Search executes these filters.
When you build a filter on collection fields in Azure AI Search, you can use the any
and all
operators together with lambda expressions. Lambda expressions are Boolean expressions that refer to a range variable. In filters that use a lambda expression, the any
and all
operators are analogous to a for
loop in most programming languages, with the range variable taking the role of loop variable, and the lambda expression as the body of the loop. The range variable takes on the "current" value of the collection during iteration of the loop.
At least that's how it works conceptually. In reality, Azure AI Search implements filters in a very different way to how for
loops work. Ideally, this difference would be invisible to you, but in certain situations it isn't. The end result is that there are rules you have to follow when writing lambda expressions.
Note
For information on what the rules for collection filters are, including examples, see Troubleshooting OData collection filters in Azure AI Search.
Why collection filters are limited
There are three underlying reasons why filter features aren't fully supported for all types of collections:
- Only certain operators are supported for certain data types. For example, it doesn't make sense to compare the Boolean values
true
andfalse
usinglt
,gt
, and so on. - Azure AI Search doesn't support correlated search on fields of type
Collection(Edm.ComplexType)
. - Azure AI Search uses inverted indexes to execute filters over all types of data, including collections.
The first reason is just a consequence of how the OData language and EDM type system are defined. The last two are explained in more detail in the rest of this article.
Correlated versus uncorrelated search
When you apply multiple filter criteria over a collection of complex objects, the criteria are correlated because they apply to each object in the collection. For example, the following filter returns hotels that have at least one deluxe room with a rate less than 100:
Rooms/any(room: room/Type eq 'Deluxe Room' and room/BaseRate lt 100)
If filtering was uncorrelated, the above filter might return hotels where one room is deluxe and a different room has a base rate less than 100. That wouldn't make sense, since both clauses of the lambda expression apply to the same range variable, namely room
. This is why such filters are correlated.
However, for full-text search, there's no way to refer to a specific range variable. If you use fielded search to issue a full Lucene query like this one:
Rooms/Type:deluxe AND Rooms/Description:"city view"
you might get hotels back where one room is deluxe, and a different room mentions "city view" in the description. For example, the document below with Id
of 1
would match the query:
{
"value": [
{
"Id": "1",
"Rooms": [
{ "Type": "deluxe", "Description": "Large garden view suite" },
{ "Type": "standard", "Description": "Standard city view room" }
]
},
{
"Id": "2",
"Rooms": [
{ "Type": "deluxe", "Description": "Courtyard motel room" }
]
}
]
}
The reason is that Rooms/Type
refers to all the analyzed terms of the Rooms/Type
field in the entire document, and similarly for Rooms/Description
, as shown in the tables below.
How Rooms/Type
is stored for full-text search:
Term in Rooms/Type |
Document IDs |
---|---|
deluxe | 1, 2 |
standard | 1 |
How Rooms/Description
is stored for full-text search:
Term in Rooms/Description |
Document IDs |
---|---|
courtyard | 2 |
city | 1 |
garden | 1 |
large | 1 |
motel | 2 |
room | 1, 2 |
standard | 1 |
suite | 1 |
view | 1 |
So unlike the filter above, which basically says "match documents where a room has Type
equal to 'Deluxe Room' and that same room has BaseRate
less than 100", the search query says "match documents where Rooms/Type
has the term "deluxe" and Rooms/Description
has the phrase "city view". There's no concept of individual rooms whose fields can be correlated in the latter case.
Inverted indexes and collections
You might have noticed that there are far fewer restrictions on lambda expressions over complex collections than there are for simple collections like Collection(Edm.Int32)
, Collection(Edm.GeographyPoint)
, and so on. This is because Azure AI Search stores complex collections as actual collections of subdocuments, while simple collections aren't stored as collections at all.
For example, consider a filterable string collection field like seasons
in an index for an online retailer. Some documents uploaded to this index might look like this:
{
"value": [
{
"id": "1",
"name": "Hiking boots",
"seasons": ["spring", "summer", "fall"]
},
{
"id": "2",
"name": "Rain jacket",
"seasons": ["spring", "fall", "winter"]
},
{
"id": "3",
"name": "Parka",
"seasons": ["winter"]
}
]
}
The values of the seasons
field are stored in a structure called an inverted index, which looks something like this:
Term | Document IDs |
---|---|
spring | 1, 2 |
summer | 1 |
fall | 1, 2 |
winter | 2, 3 |
This data structure is designed to answer one question with great speed: In which documents does a given term appear? Answering this question works more like a plain equality check than a loop over a collection. In fact, this is why for string collections, Azure AI Search only allows eq
as a comparison operator inside a lambda expression for any
.
Next, we look at how it's possible to combine multiple equality checks on the same range variable with or
. It works thanks to algebra and the distributive property of quantifiers. This expression:
seasons/any(s: s eq 'winter' or s eq 'fall')
is equivalent to:
seasons/any(s: s eq 'winter') or seasons/any(s: s eq 'fall')
and each of the two any
sub-expressions can be efficiently executed using the inverted index. Also, thanks to the negation law of quantifiers, this expression:
seasons/all(s: s ne 'winter' and s ne 'fall')
is equivalent to:
not seasons/any(s: s eq 'winter' or s eq 'fall')
which is why it's possible to use all
with ne
and and
.
Note
Although the details are beyond the scope of this document, these same principles extend to distance and intersection tests for collections of geo-spatial points as well. This is why, in any
:
geo.intersects
cannot be negatedgeo.distance
must be compared usinglt
orle
- expressions must be combined with
or
, notand
The converse rules apply for all
.
A wider variety of expressions are allowed when filtering on collections of data types that support the lt
, gt
, le
, and ge
operators, such as Collection(Edm.Int32)
for example. Specifically, you can use and
as well as or
in any
, as long as the underlying comparison expressions are combined into range comparisons using and
, which are then further combined using or
. This structure of Boolean expressions is called Disjunctive Normal Form (DNF), otherwise known as "ORs of ANDs". Conversely, lambda expressions for all
for these data types must be in Conjunctive Normal Form (CNF), otherwise known as "ANDs of ORs". Azure AI Search allows such range comparisons because it can execute them using inverted indexes efficiently, just like it can do fast term lookup for strings.
In summary, here are the rules of thumb for what's allowed in a lambda expression:
- Inside
any
, positive checks are always allowed, like equality, range comparisons,geo.intersects
, orgeo.distance
compared withlt
orle
(think of "closeness" as being like equality when it comes to checking distance). - Inside
any
,or
is always allowed. You can useand
only for data types that can express range checks, and only if you use ORs of ANDs (DNF). - Inside
all
, the rules are reversed. Only negative checks are allowed, you can useand
always, and you can useor
only for range checks expressed as ANDs of ORs (CNF).
In practice, these are the types of filters you're most likely to use anyway. It's still helpful to understand the boundaries of what's possible though.
For specific examples of which kinds of filters are allowed and which aren't, see How to write valid collection filters.