There is not any formal support for non-equality operations on encrypted values in Cosmos DB.
Due to the nature of how always encrypted works, the service never gets enough information about the string to be able to index it or even scan it to fulfill range or partial comparisons. This is by design if you need to ensure that Cosmos DB never has your data in plaintext, even in memory. Any solution for this would need to be something your application layer orchestrates.
There are options you could consider, broadly, for this type of problem. None of them are particularly straightforward and you'll need to do additional research and verification with security experts. If this is not a fundamental requirement, the effort and cost may not be worth it.
- Tokenization: If you need to perform "LIKE" expressions on an encrypted string, you could tokenize that string into several substrings (for example, splitting based on whitespace). You'd then insert a document for each substring with a pointer back to the original document and the encrypted substring. When you did a search, you'd then take the search term and pass it through the same tokenizer, then do exact matches on all (or some) of the tokens.
//pseudo-code - not tested, no error handling, etc.
source = database['source'].items // assume "encrypted_string" is encrypted
index = database['index'].items // assume "encrypted_token" is encrypted
// simple tokenizer which lower cases all letters, removes any symbols like ".,/()", and splits based on whitespace
function tokenize(plaintext_string){
return plaintext_string.toLower().replace(SYMBOLS,'').split(WHITESPACE)
}
// Each insert creates 1 doc in source and N-docs in index, depending on tokenized length
function insert(doc) {
const tokens = tokenize(doc.encrypted_string)
source.create(doc)
for(const token of tokens) {
index.create({pk: token, id: `${doc.pk}##${doc.id}`, source_pk: doc.pk, source_id: doc_id)
}
}
const SEARCH_PREFIX = 'select * from index where '
// Returns an array of results from source, and includes a "match_terms" field which has which terms matched
function search(query_str) {
const query_tokens = tokenize(query_str).map(token => `encrypted_token = "${token}"`
const sql_filter = SEARCH_PREFIX + query_tokens.join(' or ')
const matches = index.query(sql_filter).fetchAll()
const results = {};
for(const match of matches) {
const result = source.item(match.source_pk, match.source_id)
const result_key = match.id
const match_term = match.pk
if(results[match.id] !== undefined) {
results[match.id].match_terms.push(match_term)
} else {
results[match.id] = { ...result, match_terms: [match_term] }
}
}
return results
}
insert({pk: 'chris@example.com', id: '12345', encrypted_string: 'very secret string...'})
// > inserts 1 doc into source and 3 docs into index for 'very', 'secret', and 'string'. Ignores the '...' at the end of the string
insert({pk: 'topher@example.com', id: '007', encrypted_string: 'don't string me along!'})
// > inserts 1 doc into source and 4 docs into index for 'dont' (' removed), 'string', 'me' and 'along'. Ignores the '!' at the end of the string
const results = search('secret string')
console.log(results.map(i=> [i.id, i.match_terms.length])
// > prints [[12345, 2], [007, 1]] because 12345 matches two terms ('string' and 'secret') while 007 matches only 1 ('string')
Notes:
- Throttling: In this design, since the term is the pk, you could end up with throttling if a common term is frequently created or read. You could consider prefixing with the source pk (but then your searches are always going to need to be pk scoped) or using a tokenizer algorithm which removes common words.
- Error handling: Since the insert is creating multiple documents, you could end up with issues where you fail to insert into the index, but want to still return the success of the source. One way to address this is to have an Azure Function which processes the changefeed and does the indexing work for you, but that would mean that searches would have some inconsistency. You can also do both where the Azure Function is only going to actually insert things if there was an error in the application itself. Even with the example above, there is a race condition between the source insert and the index inserts themselves where searches could return partial results.
- Functionality: This only covers new inserts, not updates. Updates would work similarly, but would need to be able to identify which tokens needed to be deleted and, optimally, only create new indexes for new tokens.
- Sub-token: If you want support for sub-token search, you'd need a tokenizer algorithm which created index items for each subtoken. For example, you might want searches on 'Chris' to include tokens like 'Christopher'. You'd either need some rules engine which created a 'Christopher' and 'Chris' token whenever you saw 'Christopher' or you could split each token into a permutation of substrings ('abc 123' => 'abc', 'ab', 'a', '123', '12', '1'). Generally, there are a lot of existing tokenizer algorithms out there that I'd recommend using off the shelf before writing your own if you're getting into these kinds of complex support needs.
- Cost: Because this is going to be creating a lot of extra items and storage, you should only really use this approach when absolutely necessary. You would definitely want to consider a tokenizer algorithm which only used fairly unique keywords.
- Security: If you take an encrypted string and split it up into encrypted tokens, you're sharing more information about the contents of that string. You can now see which encrypted strings share the same token. If those tokens are simple, it is also possible for a bad actor to guess the plaintext value of a token because of frequency/etc. It is possible this could give enough information to a bad actor to reveal information you'd rather not reveal, but this would be very context dependent. You must review any approach with a security expert to consider the risk presented with this approach.
- Homomorphic Encryption: If you forgo Cosmos DB managed encryption, which only supports equality operations, you could use a homomorphic encryption library to compare data. The challenge with this is that since Cosmos DB doesn't support those operations natively, you'd still likely need to pull more data locally to analyze than would be optimal. Homomorphic encryption is still relatively new and may not meet specific compliance requirements. Generally, I'd recommend against this approach unless you're confident it would be worth the effort and meet your requirements at the outset. Microsoft has a library, SEAL, which is one example of how to do this: https://github.com/microsoft/SEAL/tree/main