Before discussing how to improve relevance, it is helpful to understand what relevance is in terms of a search engine. As a simple definition, a relevant result for LucidWorks Enterprise is a search result that provides useful information to the user in response to a query. Defining what "useful information" is a bit trickier, since it is dependent on the user's point of view. Many factors contribute to defining what is useful, some of which are: the chosen query terms and query structure, as well as the user's background and past search experience. Despite this subjectivity, it is possible to describe many different techniques which help improve relevance for most users in most situations.
Relevance should always be judged in the context of a specific index and a set of queries for that index. Often the best way to achieve this is through query log analysis. In a typical query log analysis, the top 50 queries (give or take) are extracted from the logs, plus ten to twenty random queries. Next, one to three users enter each query into the system and then judge the top ten (or five) results. Judgments may be done using values of relevant, somewhat relevant, not relevant and embarrassing. The goal of relevancy tuning is to maximize the number of relevant documents while minimizing the number of embarrassing ones.
By recording these values and repeating the test over time, it becomes possible to see if relevancy is getting better or worse for the particular system in question.
An alternative method for judging relevance is to use what is commonly referred to as A/B testing. In this approach, some set of users are shown results using one version of the index while another set of users is shown the results from a different version. To judge the success of a particular approach, user clicks are tracked and analyzed to determine which approach provides better results.
|Click Scoring Relevance Framework|
One important aspect of LucidWorks Enterprise relevance scoring functionality is the ability to boost documents that prior users have selected. This functionality is the Click Scoring Relevance Framework and can be enabled through the Administrative User Interface.
In this section, the discussion will focus on techniques used during indexing. While these techniques almost always have to be mirrored on the query side, they will be considered here as they originate during indexing.
When indexing using the Solr APIs it is possible to mark one Document or Field as being more important than other Documents or Fields by setting a boost value during indexing. These boost factors are then multiplied into the scoring weight during search, thus potentially boosting the result higher up in the result set. This type of boosting is usually done when a priori knowledge about a Document's importance is known beforehand. However, index time boosting only provides 255 distinct values of granularity and if a change is needed to the boost value, the document must be re-indexed.
Stemming is the process of reducing a word to a base or root form. For example, removing plurals, gerunds ("ing" endings) or "ed" endings. Lemmatization is a variation of stemming that leaves a whole word in place, while stemming need not do that. There are many stemming theories and techniques. Some are quite aggressive, stripping words down to very small roots, while others (called light stemmers) are less aggressive.
Solr has a well-defined mechanism for plugging in stemmers via the Analysis Process which makes it easy to try out different stemmers. It also has an easy to use Admin interface for testing the analysis process located at http://localhost:8888/solr/collection1/admin/analysis.jsp.
LucidWorks Enterprise currently ships with many options for stemming. It is also possible to plug in a custom analyzer or use other Solr or Lucene analyzers not included in the Solr distribution. To do this, see the Analysis Process link above. As a general rule of thumb, it is usually best to start with a light stemming approach that removes plurals and other basics techniques and then progress to more aggressive stemming only after performing some relevance testing as described in Judging Relevance.
Default stemming in LucidWorks Enterprise uses the Lucid Plural Stemmer for the default English text analysis Field Type which simply stems plural words into their singular form, although rules can be added to a rules file to protect and specially translate words or even add or modify stemming rules as needed (see the Stemming and Lucid Plural Stemmer Rules File Format sections.) The Lucid KStem stemmer is available, which is based on the KStemmer light stemming library. More aggressive stemmers are also available, like Dr. Martin Porter's Snowball stemmers (choose the "text (English Snowball)" Field Type).
When indexing, it is often useful to apply several different analysis techniques to the same content. For example, providing a default case-insensitive search is often the best choice for general users, but expert users will often want to do exact match searches which additionally require a case-sensitive field. In Solr, this can be accomplished by using the <copyField> mechanism, as described in the Solr Wiki section on the Schema. This currently can be set up by editing the schema.xml file and restarting LucidWorks Enterprise.
Other examples of alternate fields include different stemming approaches, using character-based and word-based n-grams, and stripping punctuation, accents and other marks. As with any change, though, some time should be taken to evaluate whether it is an improvement.
Removing stopwords from the index and stripping them from queries are common techniques for reducing the size of an index and improving search results, despite the fact that it throws away information. While LucidWorks Enterprise can remove stopwords at indexing time, it does not by default. Instead, LucidWorks Enterprise excludes stopwords at query times, except in certain types of queries where they are used to better clarify the user's intent (such as in phrases). Thus, the key with stopwords is not to throw them away, but to know when to use them. Both the Extended Dismax Query Parser and the Lucid Query Parser can take advantage of stopwords to improve results by using them in an n-gram approach.
On the search side, there are many techniques for improving relevance, the most important of which is user education. While the techniques described below can make things much easier for users, educating users on how to use the proper query syntax, when to use it, and how to refine queries can be instrumental in enhancing the relevance of search results. Obviously, not all users will read manuals or take the time to learn new query syntax, so the following techniques can be used to achieve better results in many situations.
Similar to Document/Field boosting, terms in a query can be boosted. Boosting a query term implies that the term in question is somehow more important than the other terms in the query. One advantage of query time boosting is an expanded level of granularity is available for expressing the boost value. Additionally, the boost value is not "baked in" to the index, so it is easier to change.
Synonym expansion is a common technique that looks up each token in the original query and expands it with synonyms; strictly speaking, synonym expansion mostly improves recall – the ability to get more relevant documents – rather than relevance ranking or the exclusion of irrelevant documents. For instance, a user query contain "USA" could be expanded to look like (USA OR "United States" OR "United States of America"), which will likely bring back results that the user intended to retrieve, but did not fully specify. In LucidWorks Enterprise, it is easy to specify a list of synonyms that can be used for expansion. Synonym lists are best created by analyzing query logs and then looking up synonyms for (common) query terms and then testing the results. Generic synonym lists (like those obtained from WordNet) can be useful, but care must be taken as too many synonyms can be problematic for users, especially if they are not appropriate for the genre of the index. It is, however, quite common to produce synonym lists contain common abbreviations, numbers (e.g., 1 -> one, 2 -> two, etc.) and acronyms.
Unsupervised feedback is a relevancy tuning technique that executes the user's query, takes the top five or ten documents from the result, extracts "important" terms from each of the documents and then uses those terms to create a new query which it then executes and whose results are returned to the user. This is all done automatically in the background with no interaction required by the end user. As an example, if the user searches for the word "dog" and the top three documents are (for the sake of example):
- Great big brown dogs run through the woods.
- Dogs don't like cats.
- A poodle is a type of dog.
The feedback query might look something like (dog) OR (great OR big OR brown OR dog OR run OR woods OR cat OR poodle).
Since these terms co-occur with the word "dog" in high ranking documents, the theory goes that they are terms that can help better specify the user's short query. Unsupervised feedback is often viewed as a helper, but it does rely on the assumption that the top few documents are highly relevant to the search. If they are not, then the results incorporating feedback will likely be worse than those without feedback.
Unsupervised feedback is optional in LucidWorks Enterprise and is disabled by default. It may be enabled by checking the Enable Unsupervised Feedback check box in the query settings panel of the Administration User Interface.
Since traditional unsupervised feedback is prone to introducing too many less relevant documents even as it does indeed boost relevant documents, LucidWorks Enterprise supports and defaults to the option to Emphasize Relevancy (when unsupervised feedback is enabled, that is), which means that the feedback terms will be combined with the original user query using the AND operator. This will assure that additional documents are not added to the results, but that documents containing the feedback terms will be boosted.
The feedback query when Emphasize Relevancy has been selected might look something like (dog) AND (great OR big OR brown OR dog OR run OR woods OR cat OR poodle).
The option to Emphasize Recall will perform traditional unsupervised feedback, as illustrated in the original example, by combining the feedback terms with the original user query using the OR operator. This option is best used when you would like to know about documents that may be similar to the original but may not include the full original user query.
The Emphasize Relevancy/Recall options are found under the Enable Unsupervised Feedback check box in the query settings panel of the admin UI.
Supervised feedback is similar to unsupervised feedback except that users explicitly pick which results are relevant, usually by clicking the result or checking a box indicating it is relevant. The LucidWorks Enterprise feedback component does not currently support supervised feedback.
It is often the case that particular queries cause a good deal of pain (in terms of relevance). There are several key things to keep in mind when debugging these problems. First and foremost, determine how important the query is. Is it a common query or does it only occur once in a great while? If it is a relatively rare query, it may not be worth the effort to try to "fix" it. Second, don't overtune. Fixing one query may break ten other queries. Unless there is an obvious fix, it is recommended that relevance judgments be established first so that any breakages can be quickly caught. After the need to fix a problem is established, there are some techniques to do just that.
The first thing to do when debugging is to run the query in debug mode by appending &debug=on to the request. By turning on debug mode, it is then possible to see why an individual result scored the way it did, as is the case here:
1 The overall score for the document given the expanded query.
2 The score from the main query
3 A boost of 5.0 is applied to the title.
4 The term frequency (tf) of the word in the Document. The word "lucene" occurs once in the title Field of the Document.
5 The IDF (inverse document frequency) for the word in the title Field. The word "lucene" occurs in the title Field in 2 Documents out of a total of 213 Documents.
6 Apply length normalization to the Field.
7 The IDF for the word in the text_all Field. The word occurs in 38 documents out of 213.
8 The default query for LucidWorks Enterprise can boost more recent documents higher than older documents.
|A Word on Scoring|
Scores for results for a given query are only useful in comparison to other results for that exact same query. Trying to compare scores across queries or trying to understand what the actual score means (i.e. 2.34345 for a specific document) may not be an effective exercise. However, understanding the components that were factored in to make that score is a different story.
Another useful tool for debugging is the Luke tool. Luke is an easy to use GUI that provides valuable information about the underlying Lucene index. Its features include document browsing, query testing, term browsing (including high frequency terms) and statistics about the collection as a whole. To use Luke with LucidWorks Enterprise, launch it using the script located in the luke directory under the installation.
Once Luke is launched, point it at the LucidWorks Enterprise index directory ($LWE_HOME/solr/collection1/data/index) and open the index. From there, the most useful actions are to view the high frequency terms, and also particular documents (under the Documents tab) using the "Browse by term" and "Browse by document number" options. Key items to look for are missing documents and fields, terms or words that aren't tokenized "correctly". Correctly, in this case, doesn't necessarily mean the analysis process was wrong, it may mean that the output is not what a user would expect. For instance, the word may be stemmed in an unexpected way.
|Luke in LucidWorks Enterprise|
LucidWorks Enterprise packages a version of Luke, which is provided 'as is'.
Once issues have been discovered and understood, then it is best to develop a strategy for fixing or working around the issue. This can often mean changing analysis steps. To help better visualize the analysis process, Solr ships with an analysis tool that effectively shows the outcome of each analysis step on both the indexing side and the query side. To use this tool, point a browser at http://localhost:8888/solr/collection1/admin/analysis.jsp? and enter the text to be analyzed. By trying out the text with different analysis capabilities (by selecting different Fields or Field Types), it is possible to better understand why matches may or may not occur.
The Query Elevation Component is a Solr SearchComponent that enables editorial control of results by allowing specific results to be placed in specific positions for a given query. For example, if a particular FAQ answer is buried in the result set for a query, then it can be "promoted" to occur as the first result by setting making it so in the Query Elevation Component. For details on how to configure the component, see the Solr Wiki section titled QueryElevationComponent. It can be configured by editing solrconfig.xml manually, or using the Settings REST API.
Configuring the QueryElevationComponent requires restarting LucidWorks Enterprise.
The standard mechanism in Solr for adding external field data (which may affect ranking) is through the use of ExternalFileField type. This mechanism is sufficient when adding simple string or numeric values to be processed by function queries, but it's not sufficient to express more complex scoring mechanisms, based on other regular query types.