CDN Search, Part 2: Background Jobs
The previous post in this series described how GetPublished processes content and passes it to the Lucene index. I noted there was a problem with the final steps of the process, which were:
- GetPublished indexes the content by calling the search web service.
- GetPublished stores all data in the database.
The problem is that since we’re updating multiple data stores (the Lucene index and GetPublished’s database), we have no way of controlling the entire process as a single transaction. Writing a distributed transaction manager and extending both GetPublished and Lucene to use it wasn’t really an option, but we needed a way to make sure the database and the Lucene index matched.
When faced with complex design decisions, it help to analyze the risks involved. Here’s my initial risk assessment. It estimates the risk of receiving incorrect search results for index-only and database-only failures (if both operations succeed or fail, the database and index match, and we have no problem).
| Index-only failure | Database-only failure | |
| New content | Medium | None |
| Changed content | Low | High |
| Deleted content | None | High |
The risks are based on the following assumptions:
- Only content in the database is "real" - this is what users see.
- GetPublished rarely modifies existing content - most edits generate new versions.
- All search results have to be filtered for visibility by GetPublished.
Based on these assumptions, we can determine that there’s generally little risk in index failures. That risk can be further mitigated by retrying to index later. On the other hand, database-only failures are usually a high risk, so we’d better find a way to avoid them. I therefore based the design on the following requirements:
- If we were unable to commit changes to the database, we must not update the index.
- If we were unable to update the index, we should retry later.
The first part is easy: all we have to do is make sure all processing is wrapped in a database transaction. Only if the transaction is successfully committed, can we index the content. By only reading committed data and running the indexing code outside the content processing transaction we can make sure we’re only indexing successfully committed data.
The second part is the real problem. Once data is committed, we have to try to index it. But what if we can’t? We can’t keep trying indefinitely - we need to respond to user requests. So, we need to run the indexing request in another thread, one that can keep trying while we return control to the user.
Background threads are a particular problem in web applications. Web applications usually only live for the duration of a single request. The web server software (IIS, in GetPublished’s case) can - and will - terminate the process, including background thread, at any time. Additional complications arise from GetPublished’s distributed architecture and ASP.NET’s application model. GetPublished is designed to run on multiple servers concurrently (web farm). ASP.NET applications run in application pools, which may be recycled (manually by administrators, or automatically by the server or external monitoring tools), terminating all running threads.
GetPublished solves this problem by implementing a system of background jobs. Instead of arbitrary threads, jobs are special classes that can be tracked and monitored by GetPublished. Since the application runs on multiple servers, and can be restarted at any time, jobs are stored in the database. When GetPublished starts, it creates a job monitoring thread, whose job is to periodically check the database and start job threads as necessary:
Requests that require processing that may take longer than a normal web request can use background jobs. Instead of spawning worker threads, requests can create job records in the database. The job monitoring thread will then pick up the request and spawn a worker thread next time it checks the database. GetPublished even lets administrators monitor, stop, and restart running jobs:
Using background jobs, we can now change the article submission process:
The indexing job is created as part of the same transaction that stores the content. This means the job will only run if the data has been successfully committed. The job calls the web service to index the content, and can safely retry the operation if necessary without blocking the user.
So far, we’ve covered indexing. That’s only half the job of a search engine. Next time, I’ll describe the actual search.
Share This | Email this page to a friend
Posted by Yorai Aminov on September 23rd, 2008 under CDN, GetPublished |Server Response from: BLOGS1


RSS Feed
Leave a Comment