Using Solr to Search Magento by Partial SKU

I recently needed to implement searching by partial SKU in Magento while using Solr. A quick search on the internet turned up a pair of posts (here and here) that were nearly identical. Both posts offer a bit of a walk through of the code, though not explaining what all the code does. I’m going to take the time here to break down the configuration changes needed to make this work. I will also discuss where I made changes to the proposed configuration from those posts.

First, the full code (for the TL;DR of you):

In schema.xml add the following changes:

1) Towards the bottom of the document inside the “schema” node add:

<copyField source="sku" dest="sku_partial" />

2) Inside the “fields” node add:

<field name="sku_partial" type="sku_partial" indexed="true" stored="true"/>

3) Inside the “types” node add:

<fieldType name="sku_partial" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="front" />
        <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="back" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
    </analyzer>
<fieldType>

4) Next in solrconf.xml find the “requestHandler” node(s) used for your store’s locale. (You can do this by searching for “magento_en” for English, and “magento_fr” for French). Now change the line(s) following lines from:

<str name="qf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0
<str name="pf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0

to:

<str name="qf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0 sku_partial^1.0
<str name="pf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0 sku_partial^1.0

Now I’ll walk you through the configuration, line by line and explain what each line is actually telling Solr to do.

1) <copyField source="sku" dest="sku_partial" /> This code is telling Solr to copy the data from the SKU attribute that Magento sends it into a new field called “sku_partial”. We will define this field in the one of the following steps.  We do this so that we can manipulate how Solr treats that field without affecting the original data.

2) <field name="sku_partial" type="sku_partial" indexed="true" stored="true"/>
This is where we define our custom field that we copied the data into. Notice we are using a “type” of “sku_partial”. That is a custom field type that we will setup next (hint: that’s where the magic happens to allow us to search on partial values)

3) Now I will go through the field_type definition line by line:

<fieldType name="sku_partial" class="solr.TextField">
This sets up the custom field type and inherits from solr.TextField. This is just a plain text value (alphanumeric).

<analyzer type="index">
This tells us that the inclosed lines are used during indexing as opposed to during querying.

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
This is a tokenizer. A tokenizer defined in the indexer breaks up each document into many parts. Those parts are then treated as separate pieces of information that are examined during search. In this case, Solr is breaking up the document using white space (spaces, tabs, new lines, etc…)

<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="front" />
This line defines a filter to apply to the document. Filters allow you to further manipulate the contents of the document. In this case we are applying an NGramFilterFactory. This allows us to further break the content up. Here we have told Solr to break up the contents into chunks as small as 3 characters all the way up to 1000 characters. So text such as “supercalafragalisticexpialodcious” gets broken into:
sup supe super ....
Why would you want to do this? Well, suppose you searched on “super”? You would expect to find “supercalafragalisticexpialodcious”. By breaking up the word this way, it makes it easier for Solr to find this match since one of the parts it will have indexed will be “super”.

<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" side="back" />
Notice the only thing different on this line is that side is set to “back”. This tells Solr to do the same grouping, but to work the document backwards. This yields terms like:
cious ious ous
This serves the same purpose as “front” does by providing more possibilities for a search term to match against. (I’ll concede that my chosen document word doesn’t lend itself well as an example here. Imagine a document composed of a large paragraph of text. Each word in that text would be subjected to this filter which would allow it to match partial words).

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
A StopFilter strips designated stop words from the document. A stop word are words such as “a”, “an”, “is”. These words are looked up from the file designated by the “words” attribute (in this case stopwords.txt). Why remove words? Because they are deemed irrelevant and likely to cause false-positives. Consider a query for “an apple”. Without stop words, we would return every document that contained the word “an” as well as those that contain “apple” and “an apple”.

<filter class="solr.LowerCaseFilterFactory"/>
This filter causes everything in the document to be converted to lowercase before being stored in the index. This allows for case-insensitive searches.

<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
This filter removes any duplicate terms from the index. This reduces storage and simplifies the index allowing for cleaner results. There are times you may not want to use this though.  The frequency of a term showing up in a document can yield more relevant results. (Finding the word “red” 4 times in a block of text should make that result more relevant than one where “red” is only found once)

<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
The SnowballPorterFilter applies common endings to terms. So “apple” becomes “apples”. “Profession” becomes “professional”, “professions”, etc.. These common endings are defined in the file specified by the “protected” attribute.

<analyzer type="query">
This tells us that the inclosed lines will be used during querying (as opposed to the indexer section we just finished).

<tokenizer class="solr.StandardTokenizerFactory"/>
This is a general purpose tokenizer.  It has some basic built-in rules for breaking apart each search term into various parts.

<filter class="solr.LowerCaseFilterFactory"/>
Just like in the indexer analyzer block, this filter allows for case-insensitive searches.  Here it is done by converting the search terms to lowercase before comparing them against the index.

<filter class="solr.TrimFilterFactory" />
This filter removes white space from both sides of each search term in the query.

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
A SynonymFilter does a key value match against a synonym list and adds those terms to the search query. (GB, G, gig are all synonyms for gigabyte. All these would be used to search against the index for the document allowing for a greater chance of matching.

<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
This functions just like the filter in the indexer.

<str name="qf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0 sku_partial^1.0 <str name="pf">fulltext1_en^1.0 fulltext2_en^2.0 fulltext3_en^3.0 fulltext4_en^4.0 fulltext5_en^5.0 sku_partial^1.0</str>
Both of these lines add our new custom field with a boost of 1.0 (effectively, no boost) to the query fields (qf) and phrase fields (pf). Phrase fields come into play after the results have been generated. This is where you can affect the ranking of the results further.

While that covers what each line represents, here is what I did differently from the reference posts and why. In both of those posts, you will find they include this line in the query analyzer:
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
What this line does is tell Solr to split the search term into parts based on a guess at possible combined words. So Solr will see 58SKU2 as three separate terms: 58, SKU, and 2. This is not what we want. Because now it will try to for these terms and not likely find the match we are looking for (if it finds one at all). So I removed the line from my configuration.  In the spirit of this post, here is what each of the settings does.

generateWordParts="1"
This causes Solr to split the term into word parts. So if you searched for “56MYPRODUCT1234” it would split into “56MY” and “PRODUCT1234”.

generateNumberParts="1"
Like word parts above, this splits out numbers from the term. So “56MYPRODUCT1234” becomes “56”, “MYPRODUCT”, and “1234”.

catenateWords="0"
If this is set to “1” this will combine our split out words together to form a search query.  Assuming the first 2 settings remain set at “1”, “56MYPRODUCT1234” generates the following query: 56 OR MY OR PRODUCT OR 1234 or "MY PRODUCT".

catenateNumbers="0"
If this is set to “1” this will combine our split out numbers together to form a search query.  Assuming the first 2 settings remain set at “1”, “56MYPRODUCT1234” generates the following query: 56 OR MY OR PRODUCT OR 1234 or "56 1234".

catenateAll="0"
If this is set to “1” this will combine our split out numbers and words together to form a search query.  Assuming the first 2 settings remain set at “1”, “56MYPRODUCT1234” generates the following query: 56 OR MY OR PRODUCT OR 1234 or "56 MY PRODUCT 1234".

splitOnCaseChange="1"
This will split the term when the case of the term changes. E.g. MyProduct splits into “My” and “Product”.

Comments !

blogroll

social