August 2011
Volume 26 Number 08
Forecast: Cloudy - Searching Microsoft Azure Storage with Lucene.Net
By Joseph Fultz | August 2011
You know what you need is buried in your data somewhere in the cloud, but you don’t know where. This happens so often to so many, and typically the response is to implement search in one of two ways. The simplest is to put key metadata into a SQL Azure database and then use a WHERE clause to find the Uri based on a LIKE query against the data. This has very obvious shortcomings, such as limitations in matching based only on the key pieces of metadata instead of the document content, potential issues with database size for SQL Azure, added premium costs to store metadata in SQL Azure, and the effort involved in building a specialized indexing mechanism often implemented as part of the persistence layer. In addition to those shortcomings are more specialized search capabilities that simply won’t be there, such as:
- Relevance ranking
- Language tokenization
- Phrase matching and near matching
- Stemming and synonym matching
A second approach I’ve seen is to index the cloud content from the local search indexer, and this has its own problems. Either the document is indexed locally and the Uri fixed up, which leads to complexity in the persistence and indexing because the file must be local and in the cloud; or the local indexing service reaches out to the cloud, which decreases performance and increases bandwidth consumption and cost. Moreover, using your local search engine could mean increased licensing costs, as well. I’ve seen a hybrid of the two approaches using a local SQL Server and full-text Indexing.
Theoretically, when SQL Azure adds full-text indexing, you’ll be able to use the first method more satisfactorily, but it will still require a fair amount of database space to hold the indexed content. I wanted to address the problem and meet the following criteria:
- Keep costs relating to licensing, space and bandwidth low.
- Have a real search (not baling wire and duct tape wrapped around SQL Server).
- Design and implement an architecture analogous to what I might implement in the corporate datacenter.
Search Architecture Using Lucene.Net
I want a real search architecture and, thus, need a real indexing and search engine. Fortunately, many others wanted the same thing and created a nice .NET version of the open source Lucene search and indexing library, which you’ll find at https://lucene.apache.org/core/. Moreover, Tom Laird-McConnell created a fantastic library for using Azure Storage with Lucene.Net; you’ll find it at code.msdn.microsoft.com/AzureDirectory. With these two libraries I only need to write code to crawl and index the content and a search service to find the content. The architecture will mimic typical search architecture, with index storage, an indexing service, a search service and some front-end Web servers to consume the search service (see Figure 1).
Figure 1 Search Architecture
The Lucene.Net and AzureDirectory libraries will run on a Worker Role that will serve as the Indexing Service, but the front-end Web Role only needs to consume the search service and doesn’t need the search-specific libraries. Configuring the storage and compute instances in the same region should keep bandwidth use—and costs—down during indexing and searching.
Crawling and Indexing
The Worker Role is responsible for crawling the documents in storage and indexing them. I’ve narrowed the scope to handle only Word .docx documents, using the OpenXML SDK 2.0, available at msdn.microsoft.com/library/bb456488. I’ve chosen to actually insert the latest code release for both AzureDirectory and Lucene.Net in my project, rather than just referencing the libraries.
Within the Run method, I do a full index at the start and then fire an incremental update within the sleep loop that’s set up, like so:
Index(true);
while (true)
{
Thread.Sleep(18000);
Trace.WriteLine("Working", "Information");
Index(false);
}
For my sample, I keep the loop going at a reasonable frequency by sleeping for 18,000 ms. I haven’t created a method for triggering an index, but it would be easy enough to add a simple service to trigger this same index method on demand to manually trigger an index update from an admin console or call it from a process that monitors updates and additions to the storage container. In any case, I still want a scheduled crawl and this loop could serve as a simple implementation of it.
Within the Index(bool) method I first check to see if the index exists. If an index does exist, I won’t generate a new index because it would obliterate the old one, which would mean doing a bunch of unnecessary work because it would force a full indexing run:
DateTime LastModified = new DateTime(IndexReader.LastModified(azureDirectory),DateTimeKind.Utc);
bool GenerateIndex = !IndexReader.IndexExists(azureDirectory) ;
DoFull = GenerateIndex;
Once I determine the conditions of the index run, I have to open the index and get a reference to the container that holds the documents being indexed. I’m dealing with a single container and no folders, but in a production implementation I’d expect multiple containers and subfolders. This would require a bit of looping and recursion, but that code should be simple enough to add later:
// Open AzureDirectory, which contains the index
AzureDirectory azureDirectory = new AzureDirectory(storageAccount, "CloudIndex");
// Loop and fetch the information for each one.
// This needs to be managed for memory pressure,
// but for the sample I'll do all in one pass.
IndexWriter indexWriter = new IndexWriter(azureDirectory, new StandardAnalyzer(Lucene.Net.Util.
Version.LUCENE_29), GenerateIndex, IndexWriter.MaxFieldLength.UNLIMITED);
// Get container to be indexed.
CloudBlobContainer Container = BlobClient.GetContainerReference("documents");
Container.CreateIfNotExist();
Using the AzureDirectory library allows me to use a Azure Storage container as the directory for the Lucene.Net index without having to write any code of my own, so I can focus solely on the code to crawl and index. For this article, the two most interesting parameters in the IndexWriter constructor are the Analyzer and the GenerateIndex flag. The Analyzer is what’s responsible for taking the data passed to the IndexWriter, tokenizing it and creating the index. GenerateIndex is important because if it isn’t set properly the index will get overwritten each time and cause a lot of churn. Before getting to the code that does the indexing, I define a simple object to hold the content of the document:
public class DocumentToIndex
{
public string Body;
public string Name;
public string Uri;
public string Id;
}
As I loop through the container, I grab each blob reference and create an analogous DocumentToIndex object for it. Before adding the document to the index, I check to see if it has been modified since the last indexing run by comparing its last-modified time to that of of the LastModified time of the Index that I grabbed at the start of the index run. I will also index it if the DoFull flag is true:
foreach (IListBlobItem currentBlob in Container.ListBlobs(options))
{
CloudBlob blobRef = Container.GetBlobReference(currentBlob.Uri.ToString());
blobRef.FetchAttributes(options);
// Add doc to index if it is newer than index or doing a full index
if (LastModified < blobRef.Properties.LastModifiedUtc || DoFull )
{
DocumentToIndex curBlob = GetDocumentData(currentBlob.Uri.ToString());
//docs.Add(curBlob);
AddToCatalog(indexWriter, curBlob);
}
}
For this simple example, I’m checking that the last modified time of the index is less than that of the document and it works well enough. However, there is room for error, because the index could’ve been updated via optimization and then it would look like the index was newer than a given document, but the document did not get indexed. I’m avoiding that possibility by simply tying an optimization call with a full index run. Note that in a real implementation you’d want to revisit this decision. Within the loop I call GetDocumentData to fetch the blob and AddToCatalog to add the data and fields I’m interested in to the Lucene.Net index. Within GetDocumentData, I use fairly typical code to fetch the blob and set a couple of properties for my representative object:
// Stream stream = File.Open(docUri, FileMode.Open);
var response = WebRequest.Create(docUri).GetResponse();
Stream stream = response.GetResponseStream();
// Can't open directly from URI, because it won't support seeking
// so move it to a "local" memory stream
Stream localStream= new MemoryStream();
stream.CopyTo(localStream);
// Parse doc name
doc.Name = docUri.Substring(docUri.LastIndexOf(@"/")+1);
doc.Uri = docUri;
Getting the body is a little bit more work. Here, I set up a switch statement for the extension and then use OpenXml to pull the contents of the .docx out (see Figure 2). OpenXml requires a stream that can do seek operations, so I can’t use the response stream directly. To make it work, I copy the response stream to a memory stream and use the memory stream. Make a note of this operation, because if the documents are exceptionally large, this could theoretically cause issues by putting memory pressure on the worker and would require a little fancier handling of the blob.
Figure 2 Pulling Out the Contents of the .docx File
switch(doc.Name.Substring(doc.Name.LastIndexOf(".")+1))
{
case "docx":
WordprocessingDocument wordprocessingDocument =
WordprocessingDocument.Open(localStream, false);
doc.Body = wordprocessingDocument.MainDocumentPart.Document.Body.InnerText;
wordprocessingDocument.Close();
break;
// TODO: Still incomplete
case "pptx":
// Probably want to create a generic for DocToIndex and use it
// to create a pptx-specific that allows slide-specific indexing.
PresentationDocument pptDoc = PresentationDocument.Open(localStream, false);
foreach (SlidePart slide in pptDoc.PresentationPart.SlideParts)
{
// Iterate through slides
}
break;
default:
break;
}
My additional stub and comments show where to put the code to handle other formats. In a production implementation, I’d pull the code out for each document type and put it in a separate library of document adapters, then use configuration and document inspection to resolve the document type to the proper adapter library. Here I’ve placed it right in the switch statement.
Now the populated DocumentToIndex object can be passed to AddToCatalog to get it into the index (Figure 3).
Figure 3 Passing DocumentToIndex to the AddToCatalog Method
public void AddToCatalog(IndexWriter indexWriter, DocumentToIndex currentDocument
)
{
Term deleteTerm = new Term("Uri", currentDocument.Uri);
LuceneDocs.Document doc = new LuceneDocs.Document();
doc.Add(new LuceneDocs.Field("Uri", currentDocument.Uri, LuceneDocs.Field.Store.YES,
LuceneDocs.Field.Index.NOT_ANALYZED, LuceneDocs.Field.TermVector.NO));
doc.Add(new LuceneDocs.Field("Title", currentDocument.Name, LuceneDocs.Field.Store.YES,
LuceneDocs.Field.Index.ANALYZED, LuceneDocs.Field.TermVector.NO));
doc.Add(new LuceneDocs.Field("Body", currentDocument.Body, LuceneDocs.Field.Store.YES,
LuceneDocs.Field.Index.ANALYZED, LuceneDocs.Field.TermVector.NO));
indexWriter.UpdateDocument(deleteTerm, doc);
}
I decided to index three fields: Title, Uri and Body (the actual content). Note that for Title and Body I use the ANALYZED flag. This tells the Analyzer to tokenize the content and store the tokens. I want to do this for the body, especially, or my index will grow to be larger in size than all the documents combined. Take note that the Uri is set to NOT_ANALYZED. I want this field stored in the index directly, because it’s a unique value by which I can retrieve a specific document. In fact, I use it in this method to create a Term (a construct used for finding documents) that’s passed to the UpdateDocument method of the IndexWriter. Any other fields I might want to add to the index, whether to support a document preview or to support faceted searching (such as an author field), I’d add here and decide whether to tokenize the text based on how I plan to use the field.
Implementing the Search Service
Once I had the indexing service going and could see the segment files in the index container, I was anxious to see how well it worked. I cracked open the IService1.cs file for the search service and made some changes to the interfaces and data contracts. Because this file generates SOAP services by default, I decided to stick with those for the first pass. I needed a return type for the search results, but the document title and Uri were enough for the time being, so I defined a simple class to be used as the DataContract:
[DataContract]
public class SearchResult
{
[DataMember]
public string Title;
[DataMember]
public string Uri;
}
Using the SearchResult type, I defined a simple search method as part of the ISearchService ServiceContract:
[ServiceContract]
public interface ISearchService
{
[OperationContract]
List<SearchResult> Search(
string SearchTerms);
Next, I opened SearchService.cs and added an implementation for the Search operation. Once again AzureDirectory comes into play, and I instantiated a new one from the configuration to pass to the IndexSearcher object. The AzureDirectory library not only provides a directory interface for Lucene.Net, it also adds an intelligent layer of indirection that caches and compresses. AzureDirectory operations happen locally and writes are moved to storage upon being committed, using compression to reduce latency and cost. At this point a number of Lucene.Net objects come into play. The IndexSearcher will take a Query object and search the index. However, I only have a set of terms passed in as a string. To get these terms into a Query object I have to use a QueryParser. I need to tell the QueryParser what fields the terms apply to, and provide the terms. In this implementation I’m only searching the content of the document:
// Open index
AzureDirectory azureDirectory = new AzureDirectory(storageAccount, "cloudindex");
IndexSearcher searcher = new IndexSearcher(azureDirectory);
// For the sample I'm just searching the body.
QueryParser parser = new QueryParser("Body", new StandardAnalyzer());
Query query = parser.Parse("Body:(" + SearchTerms + ")");
Hits hits = searcher.Search(query);
If I wanted to provide a faceted search, I would need to implement a means to select the field and create the query for that field, but I’d have to add it to the index in the previous code where Title, Uri and Body were added. The only thing left to do in the service now is to iterate over the hits and populate the return list:
for (int idxResults = 0; idxResults < hits.Length(); idxResults++)
{
SearchResult newSearchResult = new SearchResult();
Document doc = hits.Doc(idxResults);
newSearchResult.Title = doc.GetField("Title").StringValue();
newSearchResult.Uri = doc.GetField("Uri").StringValue();
retval.Add(newSearchResult);
}
Because I’m a bit impatient, I don’t want to wait to finish the Web front end to test it out, so I run the project, fire up WcfTestClient, add a reference to the service and search for the term “cloud” (see Figure 4).
Figure 4 WcfTestClient Results
I’m quite happy to see it come back with the expected results.
Note that while I’m running the search service and indexing roles from the Azure compute emulator, I’m using actual Azure Storage.
Search Page
Switching to my front-end Web role, I make some quick modifications to the default.aspx page, adding a text box and some labels.
As Figure 5 shows, the most significant mark-up change is in the data grid where I define databound columns for Title and Uri that should be available in the resultset.
Figure 5 Defining Databound Columns for the Data Grid
I quickly add a reference to the search service project, along with a bit of code behind the Search button to call the search service and bind the results to the datagrid:
protected void btnSearch_Click(
object sender, EventArgs e)
{
refSearchService.SearchServiceClient
searchService = new
refSearchService.SearchServiceClient();
IList<SearchResult> results =
searchService.Search(
txtSearchTerms.Text);
gvResults.DataSource = results;
gvResults.DataBind();
}
Simple enough, so with a quick tap of F5 I enter a search term to see what I get. Figure 6 shows the results. As expected, entering “neudesic” returned two hits from various documents I had staged in the container.
Figure 6 Search Results
Final Thoughts
I didn’t cover the kind of advanced topics you might decide to implement if the catalog and related index grow large enough. With Azure, Lucene.Net and a sprinkle of OpenXML, just about any searching requirements can be met. Because there isn’t a lot of support for a cloud-deployed search solution yet, especially one that could respect a custom security implementation on top of Azure Storage, Lucene.Net might be the best option out there as it can be bent to fit the requirements of the implementer.
Joseph Fultz is a software architect at AMD, helping to define the overall architecture and strategy for portal and services infrastructure and implementations. Previously he was a software architect for Microsoft working with its top-tier enterprise and ISV customers defining architecture and designing solutions.
Thanks to the following technical expert for reviewing this article: Tom Laird-McConnell