CDN Search, Part 1: New Search Engine
A few days ago we deployed a new search engine on all major CodeGear sites. At the heart of the engine is the Apache Lucene library, which I highly recommend. We’ll be providing more information about our overall implementation in the near future (update: John Kaster published an article about this), but I wanted to discuss some of the implementation details specific to GetPublished. There’s a lot to cover, so I’ll post this as a multi-part series.
The previous search engine was based on keywords rather the full-text, and had several limitations:
- The keyword parser did not correctly handle Unicode text, particularly east-Asian languages.
- Lack of a full-text index prevented phrase and proximity searches, and did not allow correct scoring of search results.
- The use of multiple databases by CDN applications prevented unified searches across all web sites.
- It was slow.
Obviously, the engine needed to be replaced. We’ve looked at several search engines in the past, but didn’t find one that matched our needs. We started rewriting our parser, but decided to take another look at the latest version of Lucene - and were pleasantly surprised. Lucene seemed more than capable of handling our needs, and since it is open source, we knew we could tweak it if necessary.
The first problem we needed to solve was how to integrate the Lucene index into our applications. CDN applications are written in several languages (Delphi, Java, C#, and PHP), run on multiple platforms (Native Windows, ASP.NET, and Linux), and use several databases (InterBase, Blackfish SQL, Oracle, and Microsoft SQL Server). In addition, most CDN applications run on multiple load-balanced application servers.
One solution would have been to let Lucene access our databases directly. This would have involved setting up special tables that match the index structure, and maintaining application-specific information on the indexing server. For multiple applications and databases, this can become a serious headache. Instead, we wrapped the search engine in a web service. Each application already controls its content, and notifies the search engine of content changes - indexing new content, re-indexing updated content, and removing deleted content:
Next, we had to decide what constitutes a "document" in the index. A document is a single entry, identified by a unique ID, that can be return as a search result. A single document can have multiple fields, but is included no more than once in search results. Each CDN application has a different concept of a document, which may or may not correspond to what users may consider a single piece of content. Here’s what we came up with:
- For blogs, a single blog post is a document.
- For CodeCentral, both a complete submission and a single source code file within a submission are considered documents. They are indexed using different IDs, so we can search for submissions, source code, or both.
- For GetPublished, a single version of an article is considered a document. GetPublished articles can have multiple versions, and GetPublished uses a complex set of rules to determine actual content visibility on its sites. Instead of indexing all possible visibility combinations, we index each version once, and let GetPublished convert search results into a set of articles visible to the user.
- For QualityCentral, a single report is a document.
GetPublished uses DocAdapter to process submitted files and convert them into HTML articles. Among other things, DocAdapter extracts source code snippets from the submitted document and uses YAPP to automatically syntax-highlight them. Since DocAdapter already knew how to extract source code from an article, it was fairly simple to extend the code to return the original snippets to GetPublished, so they could be indexed by Lucene.
Here is what happens when a user submits an article to GetPublished:
Roughly, the steps are:
- GetPublished checks the submitted files and form fields and generates the necessary data for the article (for example, a unique article ID for a new article).
- GetPublished then sends the content to DocAdapter for processing.
- DocAdapter converts the content to HTML, and extracts all code snippets.
- Each code snippet is stored, then sent to YAPP for syntax highlighting.
- DocAdapter merges the syntax-highlighted snippets into the final HMTL.
- DocAdapter returns the merged HTML, all source code snippets, and any additional required information to GetPublished.
- GetPublished indexes the content by calling the search web service.
- GetPublished stores all data in the database.
You’ll notice I painted the line from GetPublished to the web service red. This is because there’s a problem here: the process I just describes only works when there’s no error. If either step 7 (indexing) or step 8 (data storage) fails, we’re left with inconsistent data.
One possible solution would be to switch steps 7 and 8. After we store the data (and commit the transaction) successfully, we can safely call the web service, knowing we’re indexing valid content. This still doesn’t solve the problem of index failure: if the web service call then fails, we’re left with new content that’s not indexed.
Another option is to roll back the entire transaction if step 7 fails. This works in case of an indexing error, but doesn’t help us if indexing succeeded but the database commit failed. Once again, the database and index won’t match.
Since we don’t have distributed transactions with the Lucene index, what we need is a way to tie an indexing request to a successful commit, and the ability to ensure the indexing request succeeds - or at least, notify an administrator if there’s a problem.
In my next post, I’ll describe how GetPublished accomplishes this task.
Share This | Email this page to a friend
Posted by Yorai Aminov on September 21st, 2008 under CDN, GetPublished |Server Response from: BLOGS1


RSS Feed
Leave a Comment