CDN Search, Part 3: Search Results
Previously on this blog: we deployed a new search engine and used background jobs to index content.
When you run a search query on CDN, you’ll see just below your search results line of text that looks something like this:
Query processed in 375ms (31.25ms server/network, 343.75ms database)
This text describes exactly how long it took to prepare the query, run it, retrieve all information required for display from the database, check the visibility of every result, sort the results according to the user’s preference, and prepare the set of results in the current page for display.
The times I quoted above came from a run of this query, which searches all indexed sites for C++ source code that contains the word "TClientDataSet". The text is slightly inaccurate: although the total processing time was indeed 375ms, there are other valid methods of counting the other times.
The "server/network" time represents the time it took to send the query to Lucene, get the results from the web service, and convert them to classes and records GetPublished can process. Over 95% of that time is used for transferring the results over the network. The rest is taken by the actual search engine on the Lucene server and the result processing code on the server running GetPublished.
Almost everything else GetPublished does with the results involves the database, so we say the rest of the time belongs to "database". However, for large result sets the Lucene engine processing time - and certainly network transfer time - will be longer. GetPublished actually uses that time to execute additional queries, by running the code in multiple threads.
Here’s the actual processing log from GetPublished for the same query:
Search expression started Getting list of sites to include in the search Search includes one or more GetPublished sites Retrieving external sites Search includes one or more external sites Building Lucene query Lucene query: +appid:(gp blogs blogsteamb cc qc) AND (cpp.source:(+TClientDataSet)) Search sites: 1,5,7,9,10,11,12 Search thread starting Retrieving language ID Retrieving site names Retrieving visible version IDs Retrieving staged visible version IDs Retrieving welcome content types for welcome page queries Welcome content filter: IN (26,314,322) Query completed in 31.25ms The web service returned in 31.25ms The search engine ran the query in 0ms Total number of results: 59 Sorting results by AppID and version ID Adding information for live versions in search results Adding information for staged versions in search results Removing versions that are not visible to the user Updated number of items in search results: 59 Sorting results for welcome page processing Query ran before retrieving live welcome versions - retrieving now Live welcome versions: Language ID: 1 Country ID: 239 Removing welcome pages that are not visible to the user Final number of items in search results: 59 Parsing dates for sorting Sorting results Creating result list Updating site names for visible results Creating article headers for display and loading additional information Search expression processing completed in 375ms
As you can see, GetPublished does a lot more than just run the search query. The reason it does so is that it needs to convert the search results - a list of version records - to a list of articles the user can see. Here are some of the things that affect the visibility of articles and versions:
- Articles can have multiple versions, but only one version per article can be "live" (that is, visible to users).
- Articles can be mapped to multiple sites, and have different publishing and expiration dates on each site.
- GetPublished supports "staging sites", that use a different set of versions.
- Articles are filtered based on the user’s preferred language.
- Certain articles may only be visible in specific countries or regions.
- Articles of a special type, known as a "welcome page", are shown on specific site areas instead of in article lists. On such pages, only one "welcome page" can be visible.
The list of results returned by the search web service is trimmed and expanded based on the search criteria, user preferences, and visibility settings:
- Versions that are not visible to the user are removed from the list.
- If an article is mapped to multiple sites that are included in the search, additional result records are created for each site.
- Welcome pages that are mapped to multiple area are stored once per site, but all valid links are stored in the result record to be displayed later.
- Of multiple welcome pages mapped to a single area, only the one visible by the user is kept.
Many of these checks can run concurrently. For example, if the search engine takes a while to process the query, GetPublished retrieves all visible articles and welcome pages from the database. If the engine returns quickly, GetPublished only checks the database for versions that are included in the search results. Similarly, GetPublished tries to reduce the working result set as much as possible. For example, when sorting by site name, GetPublished has to set the site name of every result before sorting and paging. When sorting by other fields, GetPublished waits until the results are sorted and only sets the site name for results that are going to be displayed.
Share This | Email this page to a friend
Posted by Yorai Aminov on September 24th, 2008 under CDN, GetPublished | Comment now »CDN Search, Part 2: Background Jobs
The previous post in this series described how GetPublished processes content and passes it to the Lucene index. I noted there was a problem with the final steps of the process, which were:
- GetPublished indexes the content by calling the search web service.
- GetPublished stores all data in the database.
The problem is that since we’re updating multiple data stores (the Lucene index and GetPublished’s database), we have no way of controlling the entire process as a single transaction. Writing a distributed transaction manager and extending both GetPublished and Lucene to use it wasn’t really an option, but we needed a way to make sure the database and the Lucene index matched.
When faced with complex design decisions, it help to analyze the risks involved. Here’s my initial risk assessment. It estimates the risk of receiving incorrect search results for index-only and database-only failures (if both operations succeed or fail, the database and index match, and we have no problem).
| Index-only failure | Database-only failure | |
| New content | Medium | None |
| Changed content | Low | High |
| Deleted content | None | High |
The risks are based on the following assumptions:
- Only content in the database is "real" - this is what users see.
- GetPublished rarely modifies existing content - most edits generate new versions.
- All search results have to be filtered for visibility by GetPublished.
Based on these assumptions, we can determine that there’s generally little risk in index failures. That risk can be further mitigated by retrying to index later. On the other hand, database-only failures are usually a high risk, so we’d better find a way to avoid them. I therefore based the design on the following requirements:
- If we were unable to commit changes to the database, we must not update the index.
- If we were unable to update the index, we should retry later.
The first part is easy: all we have to do is make sure all processing is wrapped in a database transaction. Only if the transaction is successfully committed, can we index the content. By only reading committed data and running the indexing code outside the content processing transaction we can make sure we’re only indexing successfully committed data.
The second part is the real problem. Once data is committed, we have to try to index it. But what if we can’t? We can’t keep trying indefinitely - we need to respond to user requests. So, we need to run the indexing request in another thread, one that can keep trying while we return control to the user.
Background threads are a particular problem in web applications. Web applications usually only live for the duration of a single request. The web server software (IIS, in GetPublished’s case) can - and will - terminate the process, including background thread, at any time. Additional complications arise from GetPublished’s distributed architecture and ASP.NET’s application model. GetPublished is designed to run on multiple servers concurrently (web farm). ASP.NET applications run in application pools, which may be recycled (manually by administrators, or automatically by the server or external monitoring tools), terminating all running threads.
GetPublished solves this problem by implementing a system of background jobs. Instead of arbitrary threads, jobs are special classes that can be tracked and monitored by GetPublished. Since the application runs on multiple servers, and can be restarted at any time, jobs are stored in the database. When GetPublished starts, it creates a job monitoring thread, whose job is to periodically check the database and start job threads as necessary:
Requests that require processing that may take longer than a normal web request can use background jobs. Instead of spawning worker threads, requests can create job records in the database. The job monitoring thread will then pick up the request and spawn a worker thread next time it checks the database. GetPublished even lets administrators monitor, stop, and restart running jobs:
Using background jobs, we can now change the article submission process:
The indexing job is created as part of the same transaction that stores the content. This means the job will only run if the data has been successfully committed. The job calls the web service to index the content, and can safely retry the operation if necessary without blocking the user.
So far, we’ve covered indexing. That’s only half the job of a search engine. Next time, I’ll describe the actual search.
Share This | Email this page to a friend
Posted by Yorai Aminov on September 23rd, 2008 under CDN, GetPublished | 1 Comment »CDN Search, Part 1: New Search Engine
A few days ago we deployed a new search engine on all major CodeGear sites. At the heart of the engine is the Apache Lucene library, which I highly recommend. We’ll be providing more information about our overall implementation in the near future (update: John Kaster published an article about this), but I wanted to discuss some of the implementation details specific to GetPublished. There’s a lot to cover, so I’ll post this as a multi-part series.
The previous search engine was based on keywords rather the full-text, and had several limitations:
- The keyword parser did not correctly handle Unicode text, particularly east-Asian languages.
- Lack of a full-text index prevented phrase and proximity searches, and did not allow correct scoring of search results.
- The use of multiple databases by CDN applications prevented unified searches across all web sites.
- It was slow.
Obviously, the engine needed to be replaced. We’ve looked at several search engines in the past, but didn’t find one that matched our needs. We started rewriting our parser, but decided to take another look at the latest version of Lucene - and were pleasantly surprised. Lucene seemed more than capable of handling our needs, and since it is open source, we knew we could tweak it if necessary.
The first problem we needed to solve was how to integrate the Lucene index into our applications. CDN applications are written in several languages (Delphi, Java, C#, and PHP), run on multiple platforms (Native Windows, ASP.NET, and Linux), and use several databases (InterBase, Blackfish SQL, Oracle, and Microsoft SQL Server). In addition, most CDN applications run on multiple load-balanced application servers.
One solution would have been to let Lucene access our databases directly. This would have involved setting up special tables that match the index structure, and maintaining application-specific information on the indexing server. For multiple applications and databases, this can become a serious headache. Instead, we wrapped the search engine in a web service. Each application already controls its content, and notifies the search engine of content changes - indexing new content, re-indexing updated content, and removing deleted content:
Next, we had to decide what constitutes a "document" in the index. A document is a single entry, identified by a unique ID, that can be return as a search result. A single document can have multiple fields, but is included no more than once in search results. Each CDN application has a different concept of a document, which may or may not correspond to what users may consider a single piece of content. Here’s what we came up with:
- For blogs, a single blog post is a document.
- For CodeCentral, both a complete submission and a single source code file within a submission are considered documents. They are indexed using different IDs, so we can search for submissions, source code, or both.
- For GetPublished, a single version of an article is considered a document. GetPublished articles can have multiple versions, and GetPublished uses a complex set of rules to determine actual content visibility on its sites. Instead of indexing all possible visibility combinations, we index each version once, and let GetPublished convert search results into a set of articles visible to the user.
- For QualityCentral, a single report is a document.
GetPublished uses DocAdapter to process submitted files and convert them into HTML articles. Among other things, DocAdapter extracts source code snippets from the submitted document and uses YAPP to automatically syntax-highlight them. Since DocAdapter already knew how to extract source code from an article, it was fairly simple to extend the code to return the original snippets to GetPublished, so they could be indexed by Lucene.
Here is what happens when a user submits an article to GetPublished:
Roughly, the steps are:
- GetPublished checks the submitted files and form fields and generates the necessary data for the article (for example, a unique article ID for a new article).
- GetPublished then sends the content to DocAdapter for processing.
- DocAdapter converts the content to HTML, and extracts all code snippets.
- Each code snippet is stored, then sent to YAPP for syntax highlighting.
- DocAdapter merges the syntax-highlighted snippets into the final HMTL.
- DocAdapter returns the merged HTML, all source code snippets, and any additional required information to GetPublished.
- GetPublished indexes the content by calling the search web service.
- GetPublished stores all data in the database.
You’ll notice I painted the line from GetPublished to the web service red. This is because there’s a problem here: the process I just describes only works when there’s no error. If either step 7 (indexing) or step 8 (data storage) fails, we’re left with inconsistent data.
One possible solution would be to switch steps 7 and 8. After we store the data (and commit the transaction) successfully, we can safely call the web service, knowing we’re indexing valid content. This still doesn’t solve the problem of index failure: if the web service call then fails, we’re left with new content that’s not indexed.
Another option is to roll back the entire transaction if step 7 fails. This works in case of an indexing error, but doesn’t help us if indexing succeeded but the database commit failed. Once again, the database and index won’t match.
Since we don’t have distributed transactions with the Lucene index, what we need is a way to tie an indexing request to a successful commit, and the ability to ensure the indexing request succeeds - or at least, notify an administrator if there’s a problem.
In my next post, I’ll describe how GetPublished accomplishes this task.
Share This | Email this page to a friend
Posted by Yorai Aminov on September 21st, 2008 under CDN, GetPublished | 2 Comments »PageRequestManagerServerErrorException in FireFox
One of the problems we had to face when adding event support to CDN was the additional time it takes to check if there are any events scheduled in days displayed in the navigation calendar. To improve client-side performance, I used an UpdatePanel and a Timer to delay loading that information. The calendar is first rendered without highlighting days with events, but when the Timer’s Tick event is triggered, the server searches for events in the specified time range and refreshes the calendar.
A lot of us on the CDN team use FireFox 3 as our default browser, and we noticed something weird. If we navigated away from a CDN page before the calendar had been updated, we’d get the following error:
Sys.WebForms.PageRequestManagerServerErrorException: An unknown error occurred while processing the request on the server. The status code returned from the server was: 0
Now, the error message is misleading. This isn’t really a server-side error, but a client-side error. You can trap ASP.NET AJAX client-side errors using the PageRequestManager class, but I didn’t want to ignore other errors. I ended up adding the following script to the page:
<script type="text/javascript">
if (!document.all) {
window.onbeforeunload = function() {
Sys.WebForms.PageRequestManager.getInstance().add_endRequest(endRequest);
}
}
function endRequest(sender, e) {
err = e.get_error();
if (err){
if (err.name == "Sys.WebForms.PageRequestManagerServerErrorException") {
e.set_errorHandled(true);
}
}
}
</script>
The script adds an event handler for the window’s beforeunload event, which FireFox supports and - fortunately - runs before the error (the unload and pagehide events run too late). The handler registers a handler for the endRequest event, which is raised after the AJAX request is finished and can be used to handle errors. Since at this point we’re already navigating away from the page, it should be safe to simply ignore the error.
A Google search returned only one mention of this problem, but no solutions. It was mentioned that this problem might have been fixed in .NET 3.5, but unfortunately most of our ASP.NET applications are still on 2.0.
Share This | Email this page to a friend
Posted by Yorai Aminov on July 30th, 2008 under .NET, AJAX, JavaScript | 16 Comments »Events on the Developer Network
We’ve just deployed another major feature on CDN - support for events. The Developer Network already had an events system, EventCentral, but it became difficult to maintain and enhance it. In particular, we wanted to add the rich document support and role-based workflow features of GetPublished. By integrating this functionality into GetPublished events also gained support for other GetPublished features, such as multi-site support, country-specific content, mapping events to multiple areas, and tagging. As with EventCentral, the new calendar is open to the public: you can post any event you think is useful to the developer community using GetPublished.
The "Events on the Developer Network" article on CDN provides a complete overview of this new functionality, but I wanted to mention some of the other features and improvements we added either to support this functionality or just while working on it:
- We already had product information in GetPublished, which is used to display shop links. We now use this information for events, and have added icons for most products. You can expect more product-related functionality in the future.
- When displaying calendars and calculating date ranges for events, we needed to know the first day of the week in the user’s location. This is now a setting in GetPublished’s location record. Right now, most locations have their first day of the week set to Monday. If this isn’t correct for your location, let us know.
- It is now possible to easily copy an article or event simply by clicking a "Copy Article" button in GetPublished’s "Edit Article" page. This one took us a while to implement, partly because articles have lots of associated data to copy, but mostly because articles can also have attachments and embedded images. We needed to correctly parse the HTML and fix all the links without damaging the markup.
- Users can now set their preferred time zone in their account settings, or directly on CDN. This allows us to display events in the user’s time zone. We wrote some classes for listing and converting times between time zones that work in both .NET and Win32, so we’ll be able to support this functionality in other CDN applications as needed.
- At some point, the borland.com links to our sites will probably stop working. We’ve expanded our redirection logic to support permanent redirects to let search engines know the old URLs should no longer be used.
- We’ve also made some significant performance improvements. First, we moved the database to a new machine with more memory. The old machine was having a hard time coping with some of our more complicated queries, and this was slowing down all of the sites. Second, we added lots of caching for frequently accessed data. This meant logging certain data changes so we can determine whether a cache has gone stale. Since we already had an audit log for certain critical data elements, we used the same logic to track changes in cached data.
- Even with these performance improvements, events do require more data to be retrieved from the database. We now use AJAX to delay loading certain elements, so users can see the page without waiting for all queries to finish. The event navigation calendar, for example, highlights days with events, but this is handled in a separate AJAX request that is only sent after the full page has loaded.
Share This | Email this page to a friend
Posted by Yorai Aminov on July 29th, 2008 under CDN, GetPublished | 1 Comment »Tagging
Another feature, another article:
We’ve wanted to support tagging for a long time, but the CDN team is pretty small and a lot of other stuff kept getting in the way. We did support bookmarking on other sites using the "Share This" module, but now we have our own internal tagging system.
Tagging is an interesting concept. It can be a great way for users to share and find information, but the lack of a controlled vocabulary may generate a low signal-to-noise ratio. Another problem is tag spam, which we hope to reduce by requiring users to log in before they can add tags.
We’ve also taken the tagging concept a couple of steps further. First, most tagging systems treat tags as simple strings. We’ve added options for localization, using the same model as other content on the site. Second, we’ve implemented a sort of open content model for tags. Any user can edit and translate any tag, fix typos, or remove inappropriate language.
I’d be very interested to know what you think about this new feature. If you have anything to say about the concepts, the user interface, or the implementation, please leave a comment here or post a message to the borland.public.bdn.website newsgroup.
Share This | Email this page to a friend
Posted by Yorai Aminov on May 30th, 2008 under CDN | Comment now »Two Years Old
The ASP.NET version of what was then the Borland Developer Network went live two years ago today (actually, it first went online about a week earlier, but some DNS issues forced us to roll back). The original version, called "BDN 2" at the time, replicated almost all of the original community site’s content and functionality (conference proceedings were migrated a little later). It also included a new content management system, GetPublished, named after a section of the old site that allowed community members to submit content (and get paid), and a new membership system.
The most significant improvements of the new system were:
- Support for Unicode, multi-lingual content, and localized user interface.
- Ability to upload Word and HTML files and have the server automatically generate thumbnails, tables of contents, and printer-friendly output.
- Completely web-based content management system.
- Integration with other Developer Network services, such as membership and comments.
- Electronic signatures for legal agreements.
- Role-based security and workflow for content submission and publishing.
Internally, the biggest change was probably having full control over the site’s appearance and behavior - after all, we now had the source code to everything. The new system was also load-balanced on several servers, improving performance and reliability.
It’s really interesting to work on a live site. Unlike "normal" application, you don’t have the luxury of building everything in the background, freezing code for testing, and deploying new versions every now and then. A live site needs constant updating, and we regularly add new features and capabilities. We haven’t really kept a detailed change log (hey, that’s not a bad idea), but here are some of the major features we’ve added and changes we made over the last couple of years. Many of these are internal, and some are only visible to administrators or other users with sufficient privileges, but all affect the way the site looks and behaves:
- Online reviewing of unpublished content.
- Online negotiations for paid articles.
- Support for multiple sites managed completely within GetPublished. The CodeGear, Support, Turbo Explorer, Conferences, and TeamB sites all run on this platform, in addition to CDN.
- Support for multiple URLs for each site.
- Content-specific site areas, such as CDN TV.
- Automatically generated site maps.
- Article list sorting and paging.
- Keyword search.
- Automatic syntax highlighting and source code language filtering.
- RSS and Atom feeds.
- Exception trapping and email notifications for server errors.
- Dynamically loaded modules that can be placed (almost) anywhere on the page.
- Support for posting external links, not just articles.
- Virtually unlimited depth for the path hierarchy, replacing the old community/neighborhood/street model.
- Support for user locations, including location-specific content.
- Upgrade to ASP.NET 2.0 on 64-bit servers.
- AJAX controls and dialogs in GetPublished.
- Email notifications for publishing workflow events.
- Ability to consume RSS and Atom feeds (for example, the "CodeCentral Items" box on http://dn.codegear.com/).
- Support for static "micro" sites, similar to regular static sites, but managed online using GetPublished.
- The ability to stage entire sites, including content, appearance, and modules. This was used to stage the new www.codegear.com site.
- Management of product lists and links, used to generate location-specific shop links on product pages.
- Support for feature lists on product and article pages.
Some of features were live long before they were actually used (for example, modules could be placed on the page banner since June last year, but this hasn’t been used until the latest site update earlier this month). In fact, there are still some features you won’t know about until they’re used. These features have been written, tested, and deployed, but unless you have sufficient permissions to access them in GetPublished, you’ll have to wait for new content to make use of them. I realize that sounds like teasing, but my point is we’re constantly adding features, and sometimes it takes a while for these things to become useful.
Share This | Email this page to a friend
Posted by Yorai Aminov on May 24th, 2008 under CDN, GetPublished | 1 Comment »Help Insight
A user in the newsgroups asked how to insert live breaks in Delphi’s XML documentation comments, used by HelpInsight to display information about types and members. This area has very little documentation, but as it turns out Delphi uses the same syntax as other .NET languages (so it can display HelpInsight for .NET assemblies created by other tools), which is documented by Microsoft:
http://msdn.microsoft.com/en-us/library/b2s063f7.aspx
So, to insert a line break in the comment, use the <para> element.
Share This | Email this page to a friend
Posted by Yorai Aminov on May 24th, 2008 under Delphi | Comment now »Add a Feature, Write an Article
I just implemented support for client-side caching for the CodeGear sites. Here’s how it works:
Enabling Client-Side Caching of Generated Content in ASP.NET.
Share This | Email this page to a friend
Posted by Yorai Aminov on May 6th, 2008 under .NET, Delphi | Comment now »Document Adapters
One of the core elements of GetPublished is the Document Adapter, or DocAdapter. DocAdapter is a set of web services and .NET assemblies for converting rich documents into standard HTML articles that can be displayed on the CodeGear Developer Network sites. By using DocAdapter, we can accept articles in multiple formats, and ensure we get valid HTML that works with the site’s overall look and feel.
DocAdapter is accessible in two ways: as a set of web services and as a set of .NET assemblies/Delphi packages. The web services and assemblies reference each other (the web service calls the assemblies to perform its tasks, and the assemblies can reference the web service to perform their tasks remotely). Because the web service and the assemblies know each other, we don’t have to worry about type matching (making sure the types used by the web service match those used by the assemblies).
Client applications can use either technology, or both. GetPublished, for example, uses the DocAdapter web service to perform the conversion from the source format to XHTML, and the CDN.Documents assembly to generate HTML, thumbnails, and other document elements.
Format Conversion
The CDN.DocumentConverters assembly (and its web service wrapper, cleverly titled DocAdapterService) contains conversion classes capable of reading files in several formats. The classes convert document text to XHTML, which is a convenient format for additional processing. Depending on the format, they can also extract additional data. For example, the Word conversion class extracts embedded images from Word documents, stores them as separate files, and creates <img> elements in the XHTML that refer to these files. A conversion class is any implementation of the IDocumentImport interface:
IDocumentImport =interfaceprocedureImport(inputStream: Stream; docAdapter: DocumentAdapter; extractFields: Boolean);end;
Because we’re using an interface, the DocumentAdapter class doesn’t need to know anything about the conversion class other than the fact it implements the interface. This means we can implement converters without recompiling the CDN.Documents assembly, and add them as plug-ins to the calling application.
The WebServiceConversion class is a special implementation of the IDocumentImport interface that calls DocAdapter web services. All DocAdapter services are based on the same definition, expressed in WSDL. All a client application needs in order to convert a document to XHTML to to pass the URL of such a web service to the WebServiceConversion class. GetPublished stores the URLs of the DocAdapter services in the database, so new formats can be supported by simply deploying a web service and adding a single record to GetPublished’s database.
Document Processing
Once the text and images are extracted, DocAdapter can create the HTML and necessary supporting files that can be displayed on a web site. This is done by calling a single method, CreateDocumentArchive, which returns a DocumentArchive object. The DocumentArchive object contains all the necessary information, such as the final HTML (including syntax highlighting), all referenced images, thumbnails, a table of contents, keywords, and other information that may be useful. The generation of these elements is controlled by parameters passed to the CreateDocumentArchive method. Some of these parameters are:
- Maximum image width. Images wider than this value will be replaced by thumbnails that link to the full-size image.
- Maximum image height. Images taller than this value will be replaced by thumbnails that link to the full-size image.
- The width of thumbnail images generated for images wider than the maximum specified width or taller than the maximum specified height.
- Whether to keep embedded images in the final document (external image references remain unmodified).
- The depth of the table of content to generate.
- Whether to produce printer-friendly output, which doesn’t include JavaScript elements for dynamically hiding and showing images and sections.
In GetPublished, most of these parameters are associated with specific content types configurable by system administrators.
Share This | Email this page to a friend
Posted by Yorai Aminov on May 2nd, 2008 under CDN, GetPublished | Comment now »Server Response from: BLOGS1


RSS Feed