The ELC Community Blog
A knowledge exchange on Ruby on Rails and Agile Development
Advanced Solr Filters with Phonetics
by Ryan Garver on July 21, 2008
Solr is a very popular full text search engine these days. One of the big benefits over text searching in mysql or some other database that is gained with Solr beside performance is the quality of it's fuzzy searching. Most databases support basic substring comparison; in mysql this is achieved with the 'column LIKE %query%' construct. There are also some databses that will handle regular expressions as well, but these tend to be slow and don't necessarily give rise to good search results with out a lot of work. Solr, more specifically Lucene, is built for this kind of thing and is optimized for string based comparisons and indexing. Solr also comes with a number of algorithms for good fuzzy searches out of the box.
If you are using acts_as_solr in your rails app the Solr configuration that is run when you type 'rake solr:start' supports some very basic query filtering. The fragment below is found in the default schema.xml
1 <fieldtype name="text" class="solr.TextField" positionincrementgap="100">
2 <analyzer type="index">
3 <tokenizer class="solr.WhitespaceTokenizerFactory">
4 </tokenizer><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
5 </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="1" catenatenumbers="1" generatewordparts="1" catenateall="0">
6 </filter><filter class="solr.LowerCaseFilterFactory">
7 </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
8 </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
9 </filter></analyzer>
10 <analyzer type="query">
11 <tokenizer class="solr.WhitespaceTokenizerFactory">
12 </tokenizer><filter class="solr.SynonymFilterFactory" ignorecase="true" expand="true" synonyms="synonyms.txt">
13 </filter><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
14 </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="0" catenatenumbers="0" generatewordparts="1" catenateall="0">
15 </filter><filter class="solr.LowerCaseFilterFactory">
16 </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
17 </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
18 </filter></analyzer>
19 </fieldtype>
What this says is that for all text fields that are indexed, and all queries that look at text fields, the text will be split in to words based on white space, add in any synonyms it can find, drop certain "stop words", breaks up the words based on a few other triggers (hyphens, camel case, apostrophes), drops the case of all of the words, trims down some basic conjugations (Porter filter), and then removes any duplicates. The Solr wiki has a full description of these tokenizers and filters.
The real gain from fuzzy searching comes with the tilde (~) operator. This operator tells Solr to base it's relevancy score on a Levenshtein distance algorithm which looks at the number of changes to a string required to arrive at another string.
Levenshtein is great, but we can improve the quality of these search results. One reason for inaccurate search queries is misspellings. Levenshtein attempts to find words that are close, but this ignores how people actually work with words. A better solution would be to look at what kinds of substitutions are required and give certain ones higher scores. This would be based on their use in language; their effective pronunciation.
There are a number of algorithms that support comparing two words based on how similar they are in pronunciation. One of the more common ones is called Soundex. This algorithm produces a hash from a given string. Other strings that have a similar pronunciation are supposed to hash to the same value. There are a few limitations with the method. One major one is that it can't handle wrong first letters. So 'psychology' (P242), 'sychology' (S240), and 'cychology' (C420) will not match at all. There are a number of variations on Soundex as well as a few alternatives.
Solr supports a number of these phonetic filters. We'll be adding support for indexing and querying on one of the Soundex variations: Double Metaphone. Looking at the same schema.xml file as above we can add a single line to both the index and the query analyzer tag content.
1 <fieldtype name="text" class="solr.TextField" positionincrementgap="100">
2 <analyzer type="index">
3 <tokenizer class="solr.WhitespaceTokenizerFactory">
4 </tokenizer><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
5 </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="1" catenatenumbers="1" generatewordparts="1" catenateall="0">
6 </filter><filter class="solr.LowerCaseFilterFactory">
7 </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
8 <!-- Add new phonetic filter -->
9 </filter><filter class="solr.PhoneticFilterFactory" inject="true" encoder="DoubleMetaphone">
10 </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
11 </filter></analyzer>
12 <analyzer type="query">
13 <tokenizer class="solr.WhitespaceTokenizerFactory">
14 </tokenizer><filter class="solr.SynonymFilterFactory" ignorecase="true" expand="true" synonyms="synonyms.txt">
15 </filter><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
16 </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="0" catenatenumbers="0" generatewordparts="1" catenateall="0">
17 </filter><filter class="solr.LowerCaseFilterFactory">
18 </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
19 <!-- Add new phonetic filter -->
20 </filter><filter class="solr.PhoneticFilterFactory" inject="true" encoder="DoubleMetaphone">
21 </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
22 </filter></analyzer>
23 </fieldtype>
We put this at the bottom of the list to make sure that the filter is applied last. Before this will affect our search queries we'll need to re-index our content. This is required so that the Metaphone hashes can be generated and indexed. Once the index is updated our queries will be processed as normal but also be hashed. When the comparisons are made the hashes of the record and the query will be compared to determine how similar they are from a phonetic standpoint.
The results you will see are usually pretty good, but as with any fuzzy algorithm it will require some tweaking before it will "feel" right. There are also a number of alternative algorithms that my index and query faster, or give better results. See the bottom of the AnalyzersTokenizersTokenFilters Solr wiki page for a full list.
Timeline
- Advanced Solr Filters with Phonetics
- Advanced db_free_solr
- "Fixing" acts_as_solr
- USPS Package Tracking
- Setting up rmagick on Ubuntu
- Spec refactoring
Comments