Using Stats To Not Break Search

How do you change around pretty much everything in your search backend, but still remain confident that nothing has broken? (at least not in a way you didn’t expect).

We can use statistics to do that. In particular, a technique called Spearman’s rank correlation coefficient. Let’s have a look at it, and see how we can use it to compare search results before and after a change to make sure relevancy rankings haven’t gotten screwed up in the process.

Solr -> elasticsearch

We’ve been in the process of moving our site search engine from Solr to elasticsearch. There are many nice new features we can build on the marketplace sites using elasticsearch, and it’s a lot faster to index and query, with lower resource requirements.

Our guinea pig was search on Envato’s marketplace site Initial results were promising. Items turned up where expected, our ad-hoc random clicking looked good, and chendo was raving about how awesome it was. But it was very unscientific. We were just holding up a finger in the air, and saying “yeah, seems about right”.

I’m no expert on search relevancy, and I don’t know if what we had before was perfect, but it worked, and people were used to it. So we needed to make sure that we hadn’t rocked the boat too much. At the very least, we wanted to make sure that when you search for cute pig you don’t end up with Lego Kratos (as awesome as that might be), or vice-versa.

I will do science to it - Spearman’s Coefficient.

We need a way to calculate, in bulk, how differently the same items are ranked between the two search backends.

Given a search term, and a list of results from Solr and a matching list from elasticsearch, how have items moved around between the two? Ideally items won’t move much at all, but if they do, we want some metric that will tell us about it so we know where to focus our attention.

We can use a statistical model to help figure this out. In particular, “Spearman’s rank correlation coefficient” works well for this task. Here’s what Wikipedia has this to say about it:

In statistics, Spearman’s rank correlation coefficient or Spearman’s rho, named after Charles Spearman and often denoted by the Greek letter rho, is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.

Spearman’s coefficient can be used when both dependent (outcome; response) variable and independent (predictor) variable are ordinal numeric, or when one variable is an ordinal numeric and the other is a continuous variable. However, it can also be appropriate to use Spearman’s correlation when both variables are continuous.

Which is just a complicated way of saying “give me a number between 1 and -1 to indicate if two lists of items are ordered the same, in reverse, or not in the same order at all”.

Spearman’s coefficient is expressed mathematically as:

Spearman's coefficient

Where xi and yi are the rank of item i in samples X and Y respectively, and and are the mean of all ranks of items in X and Y.

Show me the code

Being coders, it makes more sense to us like this:

# xs & ys are arrays of results in order of rank
def spearman(xs, ys)
  # remove items that aren't common to both result sets
  # these are mostly outliers anyway which we want to ignore
  xs_common ={|i| ys.include? i }
  ys_common ={|i| xs.include? i }

  # calculate the mean of each set of ranks (simple sum/length calculation)
  # as both are just the sum of ranks [1,2,3,4...] and have same length,
  # we can figure it out based on an arithmetic sum
  total = 0.5 * xs_common.length * (xs_common.length + 1)
  x_mean = y_mean = total / xs_common.length

  # initialize totals that we'll need
  sum_mean_diff = 0
  sum_x_mean_diff_sq = 0
  sum_y_mean_diff_sq = 0

  # sum the differences of the items
  xs_common.each_with_index do |x, x_rank|
    x_rank = x_rank + 1 # ranking is 1-based, not 0-based
    # grab the corresponding item from the other set of ranked items
    y_rank = ys_common.index(x) + 1

    # work out the error of each item from it's mean
    x_mean_diff = x_rank - x_mean
    y_mean_diff = y_rank - y_mean

    # aggregate totals for final calc
    sum_mean_diff += x_mean_diff * y_mean_diff
    sum_x_mean_diff_sq += x_mean_diff ** 2
    sum_y_mean_diff_sq += y_mean_diff ** 2

  # final coefficient
  sum_mean_diff / Math.sqrt(sum_x_mean_diff_sq * sum_y_mean_diff_sq)

And we can drive it with some super simple test data

require 'spearman'

ranks1 = [:a, :b, :c, :d, :e]
ranks2 = [:a, :b, :c, :d, :e]
puts spearman(ranks1, ranks2) # = 1.0 (in exactly the same order)

ranks1 = [:a, :b, :c, :d, :e]
ranks2 = [:e, :d, :c, :b, :a]
puts spearman(ranks1, ranks2) # = -1.0 (in exactly reverse order)

ranks1 = [:a, :b, :c, :d, :e]
ranks2 = [:b, :a, :c, :d, :e]
puts spearman(ranks1, ranks2) # = 0.9 (a & b are out of order)

ranks1 = [:a, :b, :c, :d, :e]
ranks2 = [:b, :d, :c, :a, :e]
puts spearman(ranks1, ranks2) # = 0.3 (stuff is all over the place)

The function will return 1.0 if both lists are ordered exactly the same. -1.0 if they are exactly reversed, and something in between if they are partially ordered one way or the other.

Using Spearman’s Coefficient On Search Results

Now that we have a way of quantifying differences in order between the same search done against different search engines, can can systematically compare relevancy between our old Solr implementation, and the new elasticsearch implementation.

The rough approach is:

  1. Find the top 1000 most frequently searched terms
  2. Run each of those search terms against each search engine, record the items, and their ranking
  3. Given two lists of results for each term, feed them into our spearman function.
  4. Each term has a Spearman coefficient ideally close to 1.0. Investigate any outliers, tweak search settings
  5. Repeat till happy


So how did we go?

There were a few interesting patterns that we found. Terms for specific things, where the relevancy actually meant something were the most similar between Solr and elasticsearch. These were terms where there was a range of relevancy with some obvious matches at the top, then gradually trailing off to partial matches that were less relevant.

Terms in very broad categories that applied loosely to a large number of items, and/or terms with tight competition of a number of highly relevant items were the ones that moved the most.

Lets look at a few examples graphed out. Solr Rank for each item on horizontal axis, elasticsearch on vertical. A straight diagonal line means everything is one-to-one identical to where it was before. The more fuzziness around the line, the more things have moved.

“earth” - high Spearman coefficient

An example of a term that was virtually unchanged between the two search backends is ‘earth’. This is a term where there are a number of hits, but the top matches are obviously a better fit than the lower ranked items. The top items are models of an earth globe, then maps, then earth moving equipment, and finally down to less relevant items like rocks and soil.

‘earth’ has a Spearman coefficient of 0.99 - practically identical (remember, 1.0 indicates an absolute perfect match of rankings). Things are roughly the same order, and when they have moved, it’s only a few places up or down.


“low poly” - low coefficient

‘low poly’ is a term with very high competition. There are a lot of items that are tagged with this term. This is a hard one for a search engine to rank based on the term alone. In conjunction with other search terms it’s useful, but by itself there is a too much noise for the search algorithm to contend with.

low poly

The reason results between the two are so different is that we add a slight random boosting to item indexing. There are a number of reasons for this, but it’s partly to avoid gaming of search results by authors tweaking item tags/descriptions to artificially end up at the top of search results for popular terms.

Normally this is drowned out (by design) by the other relevancy calculations. But when there are so many items that all have very similar relevancy, the random boosting wins out and things get shuffled around quite a lot - and that shuffling is different between search engines.

“robots” - typical coefficient

At the end of the analysis, the mean coefficient for the top search terms was around0.70. This means that we’ve kept order pretty well. Some things have moved up, some have moved down, but most things are in roughly the same place.

‘robots’ was pretty typical. It has a coefficient of 0.75. Some winners, some losers, but mostly unchanged.



We’re pretty happy with the move. Doing this analysis has given us the confidence that we haven’t royally screwed things up (at least not too much…), and gives us a good starting point for further tweaking and new features built around search.

Where items have moved, it’s due to tweaks we made to relevancy boosting make results more useful where we noticed they didn’t make sense. Big differences are due to the “low poly” effect described above where things with similar relevancy move around a lot due to random boosting. And at the end of the day, both engines do things slightly differently internally, which can add differences.