CosmosDB encrypted data queries other than equality filters

KT 190 Reputation points
2024-08-02T04:23:41.5466667+00:00

Hi

Currently, encrypted properties in Azure Cosmos DB can only be used with equality filters (e.g., WHERE c.property = @Value). I need to perform queries using the LIKE operator.

https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-always-encrypted?tabs=dotnet#filter-queries-on-encrypted-properties

Encrypted properties can only be used in equality filters (WHERE c.property = @Value). Any other usage will return unpredictable and wrong query results. This constraint will be better enforced in next versions of the SDK.

Given that encryption restricts query operations to equality filters, what strategies or best practices can be used to handle scenarios where LIKE or similar queries are necessary? Are there any recommended approaches to achieve this while maintaining data security?

https://devblogs.microsoft.com/cosmosdb/like-keyword-cosmosdb/

SELECT *
FROM c
WHERE c.description LIKE "%cereal%"
Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
{count} votes

Answer accepted by question author
  1. Chris Anderson 75 Reputation points Microsoft Employee
    2024-08-14T18:25:57.18+00:00

    There is not any formal support for non-equality operations on encrypted values in Cosmos DB.

    Due to the nature of how always encrypted works, the service never gets enough information about the string to be able to index it or even scan it to fulfill range or partial comparisons. This is by design if you need to ensure that Cosmos DB never has your data in plaintext, even in memory. Any solution for this would need to be something your application layer orchestrates.

    There are options you could consider, broadly, for this type of problem. None of them are particularly straightforward and you'll need to do additional research and verification with security experts. If this is not a fundamental requirement, the effort and cost may not be worth it.

    1. Tokenization: If you need to perform "LIKE" expressions on an encrypted string, you could tokenize that string into several substrings (for example, splitting based on whitespace). You'd then insert a document for each substring with a pointer back to the original document and the encrypted substring. When you did a search, you'd then take the search term and pass it through the same tokenizer, then do exact matches on all (or some) of the tokens.
    //pseudo-code - not tested, no error handling, etc.
    
    source = database['source'].items // assume "encrypted_string" is encrypted  
    index = database['index'].items // assume "encrypted_token" is encrypted  
    
    // simple tokenizer which lower cases all letters, removes any symbols like ".,/()", and splits based on whitespace
    function tokenize(plaintext_string){
        return plaintext_string.toLower().replace(SYMBOLS,'').split(WHITESPACE)  
    }  
    
    // Each insert creates 1 doc in source and N-docs in index, depending on tokenized length
    function insert(doc) {  
       const tokens = tokenize(doc.encrypted_string)
       source.create(doc)
       for(const token of tokens) {
           index.create({pk: token, id: `${doc.pk}##${doc.id}`, source_pk: doc.pk, source_id: doc_id)
       }
    }
    
    const SEARCH_PREFIX = 'select * from index where '
    
    // Returns an array of results from source, and includes a "match_terms" field which has which terms matched
    function search(query_str) {
        const query_tokens = tokenize(query_str).map(token => `encrypted_token = "${token}"`
        const sql_filter = SEARCH_PREFIX + query_tokens.join(' or ')
        const matches = index.query(sql_filter).fetchAll()
        const results = {};
        for(const match of matches) {
            const result = source.item(match.source_pk, match.source_id)
            const result_key = match.id
            const match_term = match.pk
            if(results[match.id] !== undefined) {
                results[match.id].match_terms.push(match_term)
            } else {
                results[match.id] = { ...result, match_terms: [match_term] }
            }
        }
        return results
    }
    
    
    insert({pk: 'chris@example.com', id: '12345', encrypted_string: 'very secret string...'})
    // > inserts 1 doc into source and 3 docs into index for 'very', 'secret', and 'string'. Ignores the '...' at the end of the string
    
    insert({pk: 'topher@example.com', id: '007', encrypted_string: 'don't string me along!'})
    // > inserts 1 doc into source and 4 docs into index for 'dont' (' removed), 'string', 'me' and 'along'. Ignores the '!' at the end of the string
    
    const results = search('secret string')
    console.log(results.map(i=> [i.id, i.match_terms.length])
    // > prints [[12345, 2], [007, 1]] because 12345 matches two terms ('string' and 'secret') while 007 matches only 1 ('string')
    

    Notes:

    • Throttling: In this design, since the term is the pk, you could end up with throttling if a common term is frequently created or read. You could consider prefixing with the source pk (but then your searches are always going to need to be pk scoped) or using a tokenizer algorithm which removes common words.
    • Error handling: Since the insert is creating multiple documents, you could end up with issues where you fail to insert into the index, but want to still return the success of the source. One way to address this is to have an Azure Function which processes the changefeed and does the indexing work for you, but that would mean that searches would have some inconsistency. You can also do both where the Azure Function is only going to actually insert things if there was an error in the application itself. Even with the example above, there is a race condition between the source insert and the index inserts themselves where searches could return partial results.
    • Functionality: This only covers new inserts, not updates. Updates would work similarly, but would need to be able to identify which tokens needed to be deleted and, optimally, only create new indexes for new tokens.
    • Sub-token: If you want support for sub-token search, you'd need a tokenizer algorithm which created index items for each subtoken. For example, you might want searches on 'Chris' to include tokens like 'Christopher'. You'd either need some rules engine which created a 'Christopher' and 'Chris' token whenever you saw 'Christopher' or you could split each token into a permutation of substrings ('abc 123' => 'abc', 'ab', 'a', '123', '12', '1'). Generally, there are a lot of existing tokenizer algorithms out there that I'd recommend using off the shelf before writing your own if you're getting into these kinds of complex support needs.
    • Cost: Because this is going to be creating a lot of extra items and storage, you should only really use this approach when absolutely necessary. You would definitely want to consider a tokenizer algorithm which only used fairly unique keywords.
    • Security: If you take an encrypted string and split it up into encrypted tokens, you're sharing more information about the contents of that string. You can now see which encrypted strings share the same token. If those tokens are simple, it is also possible for a bad actor to guess the plaintext value of a token because of frequency/etc. It is possible this could give enough information to a bad actor to reveal information you'd rather not reveal, but this would be very context dependent. You must review any approach with a security expert to consider the risk presented with this approach.
    1. Homomorphic Encryption: If you forgo Cosmos DB managed encryption, which only supports equality operations, you could use a homomorphic encryption library to compare data. The challenge with this is that since Cosmos DB doesn't support those operations natively, you'd still likely need to pull more data locally to analyze than would be optimal. Homomorphic encryption is still relatively new and may not meet specific compliance requirements. Generally, I'd recommend against this approach unless you're confident it would be worth the effort and meet your requirements at the outset. Microsoft has a library, SEAL, which is one example of how to do this: https://github.com/microsoft/SEAL/tree/main

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.