Indexing DocumentDB with Azure Seach

Article
03/09/2015

This post will show how to configure DocumentDB as a data source to be indexed using Azure Search.

Background

I wrote a post Indexing Azure SQL Database with Azure Search that shows how to index an Azure SQL Database using Azure Search. Once I started playing around with it and reading through the documentation, I noticed that the documentation for the Create Data Source API mentioned DocumentDB as a data source. I thought I would write a quick post showing how to do this since I haven’t seen any documentation on it yet.

Create the DocumentDB Database

Go to the new Azure portal (https://portal.azure.com) and create a new DocumentDB service.

Provide the name, resource group, and location.

Click OK, provisioning takes around 10 minutes. Once the service is created, create a database “SampleDatabase”.

Once created, scroll down and click on the newly created database.

Click on the Add Collection button and name the new collection “families”.

Click on the newly added collection.

Click “Create Document”.

You will be able to add JSON into the editor. We create two documents:

FlintstoneFamily

{
"id": "FlintstoneFamily",
"lastName": "Flintstone",
"parents": [
{ "firstName": "Fred" },
{ "firstName": "Wilma"}
],
"children": [
{
"firstName": "Pebbles",
"gender": "female"
}
],
"pets": [{ "firstName": "Dino" }]
}

And:

RubbleFamily

{
"id": "RubbleFamily",
"lastName": "Rubble",
"parents": [
{ "firstName": "Barney" },
{ "firstName": "Betty" }
],
"children": [
{
"firstName": "Bamm Bamm",
"gender": "male"
}
],
"pets": [{ "firstName": "Hoppy" }]
}

A quick query of the documents using the SQL query:

DocumentDB Query

SELECT c.firstName from Families f join c IN f.children

We see that we have results.

Now let’s configure Azure Search.

Create the Search Service

Create a new Azure Search service.

Once created, you will need the key for the service in order to issue REST calls.

Create the Azure Search Data Source

Just as we did in the previous post, we will use the REST API to create a data source using the Create Data Source API. You need to provide the name of your service, the API Key for your Search service, the name of your DocumentDB service, and the account key for your DocumentDB service.

Create Data Source

POST https://<Your Search Service>.search.windows.net/datasources?api-version=2015-02-28 HTTP/1.1
Content-Type: application/json
api-key: <Your API Key>
{
"name" : "mydocdbdatasource",
"type" : "documentdb",
"credentials" :
{
"connectionString": "AccountEndpoint=https://<Your DocumentDB Service>.documents.azure.com;Database=SampleDatabase;AccountKey=<Your Account Key>"
},
"container" : { "name" : "families" }
}

The connection string uses the name of your DocumentDB service as well as an AccountKey. The key can be obtained in the portal.

Create an Azure Search Index

Now that we have the datasource named “mydocdbdatasource”, we create an index. You can use the portal to do this, or we can just use the API again. Azure Search does not currently support complex JSON types, it only supports simple types and a collection of strings. Note that the “parents”, “children”, and “pets” types here are represented as string collections.

Create Index

POST https://<Your Search Service>.search.windows.net/indexes/?api-version=2015-02-28 HTTP/1.1
Content-Type: application/json
api-key: <Your API Key>
{
"name":"familyindex",
"fields":[
{"name":"id","type":"Edm.String","searchable":false,"filterable":false,"retrievable":true,"sortable":false,"facetable":false,"key":true},
{"name":"lastName","type":"Edm.String","searchable":true,"filterable":false,"retrievable":true,"sortable":true,"facetable":true,"key":false},
{"name":"parents","type":"Collection(Edm.String)","searchable":true,"filterable":true,"retrievable":true,"sortable":false,"facetable":false,"key":false},
{"name":"children","type":"Collection(Edm.String)","searchable":true,"filterable":true,"retrievable":true,"sortable":false,"facetable":false,"key":false},
{"name":"pets","type":"Collection(Edm.String)","searchable":true,"filterable":true,"retrievable":true,"sortable":false,"facetable":false,"key":false}
]
}

Create the Indexer

The last step is to create an indexer. This is the part that connects the index to the data source, and allows us to run the indexer manually or on a scheduled basis.

Create Indexer

POST https://<Your Search Service>.search.windows.net/indexers?api-version=2015-02-28 HTTP/1.1
Content-Type: application/json
api-key: <Your API Key>
{
"name" : "familyindexer",
"dataSourceName" : "docdbsource",
"targetIndexName" : "familyindex"
}

Once the indexer is created, you can check its status.

Get Indexer Status

GET https://<Your Search Service>.search.windows.net/indexers/familyindexer/status?api-version=2015-02-28 HTTP/1.1
Content-Type: application/json
api-key: <Your API Key>
{
"name" : "familyindexer",
"dataSourceName" : "docdbsource",
"targetIndexName" : "familyindex"
}

The result shows how many items were processed and the last time the indexer executed. As an example, here is the result returned from the status method for my service. Note there are 2 items processed, one for each document.

Status Response

{
"@odata.context":"https://kirkesearch.search.windows.net/$metadata#Microsoft.Azure.Search.V2015_02_28.IndexerExecutionInfo",
"status":"running",
"lastResult":
{
"status":"success",
"errorMessage":null,
"startTime":"2015-03-09T18:13:06.928Z",
"endTime":"2015-03-09T18:13:07.552Z",
"errors":[],
"itemsProcessed":2,
"itemsFailed":0,
"initialTrackingState":null,
"finalTrackingState":null
},
"executionHistory":
[
{
"status":"success",
"errorMessage":null,
"startTime":"2015-03-09T18:13:06.928Z",
"endTime":"2015-03-09T18:13:07.552Z",
"errors":[],
"itemsProcessed":2,
"itemsFailed":0,
"initialTrackingState":null,
"finalTrackingState":null
}
]
}

Testing It Out

Now that we have the index, data source, and indexer, we can test the application. While we could use an SDK to write an application, it makes sense to show how easy it is to query using the REST API and a simple HTTP GET request.

GET Search Results

GET https://<Your Search Service>.search.windows.net/indexes/familyindex/docs?api-version=2015-02-28&search=Flintstone HTTP/1.1
Content-Type: application/json
api-key: <Your API Key>

The response includes our data. Our application could then parse the results to present the results to the end user. Since this is just a an HTTP GET with a request body, we can easily use this in any platform!

Search Response

{
"id":"FlintstoneFamily",
"isRegistered":false,
"lastName":"Flintstone",
"parents":
[
"{\r\n \"firstName\": \"Fred\"\r\n}",
"{\r\n \"firstName\": \"Wilma\"\r\n}"
],
"children":
[
"{\r\n \"firstName\": \"Pebbles\",\r\n \"gender\": \"female\"\r\n}"
],
"pets":
[
"{\r\n \"firstName\": \"Dino\"\r\n}"
]
}

We can see that the properties are returned just as we expect, we can access the “id” property to retrieve the value “FlintstoneFamily”, the “lastName” property to obtain the value “Flintstone”.

Flattening Data

Something to call out… notice lines 5, 10, and 14 in the previous code snippet where the complex types are represented as string arrays instead of JSON objects. This happens because we specified in the index schema that these values were of type “Collection(Edm.String)” due to the fact that Azure Search doesn’t yet understand complex JSON types. If you have a complex type and need to search against its contents, there is a query property that can be applied to the data source in order to flatten the data and enable Azure Search to index it, as shown in line 16 below. We can use an HTTP PUT to update the data source with our query:

Code Snippet

PUT https://<Your Search Service>.search.windows.net/datasources/docdbsource?api-version=2015-02-28 HTTP/1.1
Content-Type: application/json
api-key: <Your API Key>
{
"name" : "docdbsource",
"type" : "documentdb",
"credentials" :
{
"connectionString": "AccountEndpoint=https://<Your DocDB Service>.documents.azure.com;AccountKey=<Your Account Key>;Database=SampleDatabase"
},
"container" :
{
"name" : "families",
"query" : "SELECT f.id, c.firstName, f.lastName, c.gender FROM families f join c IN f.children"
}
}

If we flatten the data in this manner (changing the shape of data returned) you need to delete the index and create it again using the four string fields “id”, “firstName”, “lastName”, and “gender”. Notice I said delete the index.. updates to the index currently only allow you to add new fields, not modify existing fields.

Flattened Schema

POST https://<Your Search Service>.search.windows.net/indexes/?api-version=2015-02-28 HTTP/1.1
Content-Type: application/json
api-key: <Your API Key>
{
"name":"familyindex",
"fields":[
{"name":"id","type":"Edm.String","searchable":false,"filterable":false,"retrievable":true,"sortable":false,"facetable":false,"key":true},
{"name":"lastName","type":"Edm.String","searchable":true,"filterable":false,"retrievable":true,"sortable":true,"facetable":true,"key":false},
{"name":"firstName","type":"Edm.String","searchable":true,"filterable":false,"retrievable":true,"sortable":true,"facetable":true,"key":false},
{"name":"gender","type":"Edm.String","searchable":true,"filterable":false,"retrievable":true,"sortable":true,"facetable":true,"key":false}
]
}

Here we show the results in Fiddler when we execute a search query and can see the individual properties of the object returned, and now “gender” and “firstName” are now part of the index.

Search Results

{
"@odata.context":"https://kirkesearch.search.windows.net/indexes('familyindex')/$metadata#docs(id,lastName,firstName,gender)",
"value":
[
{
"@search.score":0.38537163,
"id":"FlintstoneFamily",
"lastName":"Flintstone",
"firstName":"Pebbles",
"gender":"female"
}
]
}

Let’s see it in Fiddler with the JSON inspector so you can see the HTTP GET request and the JSON inspector used in the response.

A word of caution here… there is only one child per household in my data set. If I had instead used “parents” where there are multiple results per household, I would not have received data for the “firstName” property because the indexer was unable to differentiate between results based on the key in the index. If this is the case for you, you will need to structure your flattening query differently to create a unique key.

To learn more about DocumentDB queries, see the article Query DocumentDB.

The Azure Search team uses UserVoice to track suggestions for service improvements, and the ability to model complex types in the index is already there. If you want to see this feature, then add some votes!

Summary

I have been playing around with DocumentDB lately and I think it is a fantastic service (more on that later). Coupling it with Azure Search just feels natural. Connecting the two is easy, and Azure Search does a fantastic job of indexing top-level content. If your documents are flat, then you will find this is a very simple win for your application. For documents with nested complexity, you will need to evaluate how you can flatten the structure.