CDN Search, Part 3: Search Results
Previously on this blog: we deployed a new search engine and used background jobs to index content.
When you run a search query on CDN, you’ll see just below your search results line of text that looks something like this:
Query processed in 375ms (31.25ms server/network, 343.75ms database)
This text describes exactly how long it took to prepare the query, run it, retrieve all information required for display from the database, check the visibility of every result, sort the results according to the user’s preference, and prepare the set of results in the current page for display.
The times I quoted above came from a run of this query, which searches all indexed sites for C++ source code that contains the word "TClientDataSet". The text is slightly inaccurate: although the total processing time was indeed 375ms, there are other valid methods of counting the other times.
The "server/network" time represents the time it took to send the query to Lucene, get the results from the web service, and convert them to classes and records GetPublished can process. Over 95% of that time is used for transferring the results over the network. The rest is taken by the actual search engine on the Lucene server and the result processing code on the server running GetPublished.
Almost everything else GetPublished does with the results involves the database, so we say the rest of the time belongs to "database". However, for large result sets the Lucene engine processing time - and certainly network transfer time - will be longer. GetPublished actually uses that time to execute additional queries, by running the code in multiple threads.
Here’s the actual processing log from GetPublished for the same query:
Search expression started Getting list of sites to include in the search Search includes one or more GetPublished sites Retrieving external sites Search includes one or more external sites Building Lucene query Lucene query: +appid:(gp blogs blogsteamb cc qc) AND (cpp.source:(+TClientDataSet)) Search sites: 1,5,7,9,10,11,12 Search thread starting Retrieving language ID Retrieving site names Retrieving visible version IDs Retrieving staged visible version IDs Retrieving welcome content types for welcome page queries Welcome content filter: IN (26,314,322) Query completed in 31.25ms The web service returned in 31.25ms The search engine ran the query in 0ms Total number of results: 59 Sorting results by AppID and version ID Adding information for live versions in search results Adding information for staged versions in search results Removing versions that are not visible to the user Updated number of items in search results: 59 Sorting results for welcome page processing Query ran before retrieving live welcome versions - retrieving now Live welcome versions: Language ID: 1 Country ID: 239 Removing welcome pages that are not visible to the user Final number of items in search results: 59 Parsing dates for sorting Sorting results Creating result list Updating site names for visible results Creating article headers for display and loading additional information Search expression processing completed in 375ms
As you can see, GetPublished does a lot more than just run the search query. The reason it does so is that it needs to convert the search results - a list of version records - to a list of articles the user can see. Here are some of the things that affect the visibility of articles and versions:
- Articles can have multiple versions, but only one version per article can be "live" (that is, visible to users).
- Articles can be mapped to multiple sites, and have different publishing and expiration dates on each site.
- GetPublished supports "staging sites", that use a different set of versions.
- Articles are filtered based on the user’s preferred language.
- Certain articles may only be visible in specific countries or regions.
- Articles of a special type, known as a "welcome page", are shown on specific site areas instead of in article lists. On such pages, only one "welcome page" can be visible.
The list of results returned by the search web service is trimmed and expanded based on the search criteria, user preferences, and visibility settings:
- Versions that are not visible to the user are removed from the list.
- If an article is mapped to multiple sites that are included in the search, additional result records are created for each site.
- Welcome pages that are mapped to multiple area are stored once per site, but all valid links are stored in the result record to be displayed later.
- Of multiple welcome pages mapped to a single area, only the one visible by the user is kept.
Many of these checks can run concurrently. For example, if the search engine takes a while to process the query, GetPublished retrieves all visible articles and welcome pages from the database. If the engine returns quickly, GetPublished only checks the database for versions that are included in the search results. Similarly, GetPublished tries to reduce the working result set as much as possible. For example, when sorting by site name, GetPublished has to set the site name of every result before sorting and paging. When sorting by other fields, GetPublished waits until the results are sorted and only sets the site name for results that are going to be displayed.
Share This | Email this page to a friend
Posted by Yorai Aminov on September 24th, 2008 under CDN, GetPublished |Server Response from: BLOGS2


RSS Feed
Leave a Comment