The first step to being able to search is to create an index. Modern search applications use a technique called an "inverted index" to make search more efficient. An inverted index is similar to the index found in the back of a book, where words extracted during the indexing process are listed and stored with pointers for each document and the total frequency of each word (which is later used in relevance ranking). The LucidWorks Platform also adds location information to each word in order to support proximity searching (where you can specify queries such as dog NEAR puppy).
This example shows how an inverted index may be constructed:
Indexing documents is the slowest part of a search application. Each document needs to be broken into individual words and a word list created. As each new document is indexed, the word list is updated with new words or existing words on the list are updated with pointers to the new documents. The index will be very big (although usually not as big as the documents themselves), and various techniques are used to compress the index to make it smaller. A smaller index saves disk space, lowering hardware costs for the search application, but also allowing faster retrieval during query processing. This compression makes adding new documents a slower process than with relational databases, for example. It is often most efficient to add documents in batches for this reason.
Advanced indexing processes, such as the one used with LucidWorks, pay attention to the fact that documents are not solely lists of sentences and words, but instead usually contain some sort of structure - an email will likely have "to" and "from" information; Word and PDF documents may have "title" and "author" information, in addition to the main "body"; product descriptions may have "price", "description" or "color" information. These are known as fields within each document. Adding field information to the word list facilitates a user's ability to search for emails from a specific person, or shoes that come in a particular color. It also allows the search application to treat the data in each field properly: dates, for example, should be treated differently than an author name, which is treated differently than a price.
Our written language includes a lot of information that is extraneous to a search application when it comes to matching user queries to words extracted from documents. For example, we end sentences with a period, or put periods between individual letters of an acronym. To humans, there is no difference between UCLA and U.C.L.A., but a computer will treat those as two different words because they are literally different strings. However, a user searching for UCLA probably does not care much whether it is spelled with periods or not in the matching document (there are cases where it matters, but generally it does not). To overcome these differences in our written language, words are normalized in several ways during the indexing process. All terms are made lower case so differences in capitalization do not impact results. Plural words are made singular so users who enter dogs will also find dog. Punctuation, apostrophes, accent marks, and other special characters are stripped.
It used to be quite common for search applications to remove very common words, often called stop words, such as a, the, of, from, and so on from the index to save disk space. Because these terms were most common, they had the largest document lists associated with them, and they are usually the least useful in actually finding the right document. However, disk space is many times cheaper now than it used to be, and compression of the index is also vastly improved so conservation of disk space is much less a concern than it used to be. And, while most of the time these terms don't add much to a search, excluding them meant they could never be used, even when they were the most essential part of the search (the failure of "to be or not to be" as a query was a common example from the 1990s of the cost of removing these stop words from the index). There may be valid reasons to remove stop words from a user's search, but there are few reasons to exclude them from the index. We will discuss the impact of stop words on a user's search later.
In order for users to be able to search, LucidWorks provides a way for administrators to configure data sources to collect documents and index them. The available data source types are preconfigured to be able to parse documents and understand the fields commonly found in documents: for example, the Web data source understands the fields commonly found on web pages, so the content found there is indexed appropriately. Data Sources are set up via the Admin UI on the Index - Sources screen or with the Data Sources API. An important factor in configuring LucidWorks is to determine how often to revisit each data source for new or updated content.