How do you change around pretty much everything in your search backend, but still remain confident that nothing has broken? (at least not in a way you didn’t expect).
We can use statistics to do that. In particular, a technique called Spearman’s rank correlation coefficient. Let’s have a look at it, and see how we can use it to compare search results before and after a change to make sure relevancy rankings haven’t gotten screwed up in the process.
Solr > elasticsearch
We’ve been in the process of moving our site search engine from Solr to elasticsearch. There are many nice new features we can build on the marketplace sites using elasticsearch, and it’s a lot faster to index and query, with lower resource requirements.
Our guinea pig was search on Envato’s marketplace site 3docean.net. Initial results were promising. Items turned up where expected, our adhoc random clicking looked good, and chendo was raving about how awesome it was. But it was very unscientific. We were just holding up a finger in the air, and saying “yeah, seems about right”.
I’m no expert on search relevancy, and I don’t know if what we had before was perfect, but it worked, and people were used to it. So we needed to make sure that we hadn’t rocked the boat too much. At the very least, we wanted to make sure that when you search for cute pig you don’t end up with Lego Kratos (as awesome as that might be), or viceversa.
I will do science to it  Spearman’s Coefficient.
We need a way to calculate, in bulk, how differently the same items are ranked between the two search backends.
Given a search term, and a list of results from Solr and a matching list from elasticsearch, how have items moved around between the two? Ideally items won’t move much at all, but if they do, we want some metric that will tell us about it so we know where to focus our attention.
We can use a statistical model to help figure this out. In particular, “Spearman’s rank correlation coefficient” works well for this task. Here’s what Wikipedia has this to say about it:
In statistics, Spearman’s rank correlation coefficient or Spearman’s rho, named after Charles Spearman and often denoted by the Greek letter rho, is a nonparametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.
Spearman’s coefficient can be used when both dependent (outcome; response) variable and independent (predictor) variable are ordinal numeric, or when one variable is an ordinal numeric and the other is a continuous variable. However, it can also be appropriate to use Spearman’s correlation when both variables are continuous.
Which is just a complicated way of saying “give me a number between 1 and 1 to indicate if two lists of items are ordered the same, in reverse, or not in the same order at all”.
Spearman’s coefficient is expressed mathematically as:
Where x_{i} and y_{i} are the rank of item i in samples X and Y respectively, and x̅ and y̅ are the mean of all ranks of items in X and Y.
Show me the code
Being coders, it makes more sense to us like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

And we can drive it with some super simple test data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

The function will return 1.0
if both lists are ordered exactly the same. 1.0
if they are exactly reversed, and something in between if they are partially ordered one way or the other.
Using Spearman’s Coefficient On Search Results
Now that we have a way of quantifying differences in order between the same search done against different search engines, can can systematically compare relevancy between our old Solr implementation, and the new elasticsearch implementation.
The rough approach is:
 Find the top 1000 most frequently searched terms
 Run each of those search terms against each search engine, record the items, and their ranking
 Given two lists of results for each term, feed them into our
spearman
function.  Each term has a Spearman coefficient ideally close to
1.0
. Investigate any outliers, tweak search settings  Repeat till happy
Results
So how did we go?
There were a few interesting patterns that we found. Terms for specific things, where the relevancy actually meant something were the most similar between Solr and elasticsearch. These were terms where there was a range of relevancy with some obvious matches at the top, then gradually trailing off to partial matches that were less relevant.
Terms in very broad categories that applied loosely to a large number of items, and/or terms with tight competition of a number of highly relevant items were the ones that moved the most.
Lets look at a few examples graphed out. Solr Rank for each item on horizontal axis, elasticsearch on vertical. A straight diagonal line means everything is onetoone identical to where it was before. The more fuzziness around the line, the more things have moved.
“earth”  high Spearman coefficient
An example of a term that was virtually unchanged between the two search backends is ‘earth’. This is a term where there are a number of hits, but the top matches are obviously a better fit than the lower ranked items. The top items are models of an earth globe, then maps, then earth moving equipment, and finally down to less relevant items like rocks and soil.
‘earth’ has a Spearman coefficient of 0.99
 practically identical (remember, 1.0
indicates an absolute perfect match of rankings). Things are roughly the same order, and when they have moved, it’s only a few places up or down.
“low poly”  low coefficient
‘low poly’ is a term with very high competition. There are a lot of items that are tagged with this term. This is a hard one for a search engine to rank based on the term alone. In conjunction with other search terms it’s useful, but by itself there is a too much noise for the search algorithm to contend with.
The reason results between the two are so different is that we add a slight random boosting to item indexing. There are a number of reasons for this, but it’s partly to avoid gaming of search results by authors tweaking item tags/descriptions to artificially end up at the top of search results for popular terms.
Normally this is drowned out (by design) by the other relevancy calculations. But when there are so many items that all have very similar relevancy, the random boosting wins out and things get shuffled around quite a lot  and that shuffling is different between search engines.
“robots”  typical coefficient
At the end of the analysis, the mean coefficient for the top search terms was around0.70
. This means that we’ve kept order pretty well. Some things have moved up, some have moved down, but most things are in roughly the same place.
‘robots’ was pretty typical. It has a coefficient of 0.75
. Some winners, some losers, but mostly unchanged.
Conclusion
We’re pretty happy with the move. Doing this analysis has given us the confidence that we haven’t royally screwed things up (at least not too much…), and gives us a good starting point for further tweaking and new features built around search.
Where items have moved, it’s due to tweaks we made to relevancy boosting make results more useful where we noticed they didn’t make sense. Big differences are due to the “low poly” effect described above where things with similar relevancy move around a lot due to random boosting. And at the end of the day, both engines do things slightly differently internally, which can add differences.