Moving the Marketplaces to Elasticsearch

TL;DR: How we got from the top chart to the bottom chart.

Solr Response ES Response

We have been playing with adding new facets to search on the Marketplaces, but Solr was not making it easy for us due to slow indexing and annoying XML schema files. I experimented with Elasticsearch during a dev iteration and decided it was worth switching.

So we did it. And I’m going to tell you how we did it.

But first, some background.

What is Elasticsearch?

Most people seem to think it’s the Amazon search product, but it’s not. Amazon’s search product is called CloudSearch and was released April 2012, whereas the Elasticsearch I’m talking about was first released in February 2010.

Like Solr, Elasticsearch is powered by Lucene, written in Java, and is licensed under Apache License 2.0. However, unlike Solr, Elasticsearch was designed to scale from the get-go.

Why is it better?

As stated above, Elasticsearch was written with scaling in mind rather than being tacked on. We were investigating sharding with Solr and found out that you lose some features like MoreLikeThis support which we are using for PhotoDune’s similar search.

The list below is indicative of why I personally think Elasticsearch is better.

Super easy to get set up and started

It was super easy to set up and get started. For example, on OS X:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ brew update                  # To ensure latest version
$ brew install elasticsearch   # Follow launchd steps if you want to start on boot
$ elasticsearch                # If you aren't running it with launchd already
$ sleep 5                      # Let's give it some time to spin up
$ curl localhost:9200          # Make sure it's working
{
  "ok" : true,
  "status" : 200,
  "name" : "Baron Brimstone",
  "version" : {
    "number" : "0.19.9",
    "snapshot_build" : false
  },
  "tagline" : "You Know, for Search"
}

REST API

This is one of my favourite features. It’s super easy to interact with Elasticsearch.

Let’s add “some_item” with the id of 1 to “some_index”:

1
2
3
4
5
6
7
8
$ curl -XPOST localhost:9200/some_index/some_item/1 -d '{"foo": "bar"}'
{
  "ok": true,
  "_index": "some_index",
  "_type": "some_item",
  "_id": "1",
  "_version": 1
}

You just take your documents, convert them to JSON and throw it at Elasticsearch. Searching looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ curl localhost:9200/some_index/_search -d '{"query": {"term": {"foo": "bar"}}}'
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.30685282,
    "hits": [
      {
        "_index": "some_index",
        "_type": "some_item",
        "_id": "1",
        "_score": 0.30685282,
        "_source": {
          "foo": "bar"
        }
      }
    ]
  }
}

Deletes are simply DELETEs.

1
2
3
4
5
6
7
8
9
$ curl -XDELETE localhost:9200/some_index/some_item/1
{
  "ok": true,
  "found": true,
  "_index": "some_index",
  "_type": "some_item",
  "_id": "1",
  "_version": 2
}

Of course, you shouldn’t actually use curl to implement searching in any application.

Super easy replication and sharding

This is probably the main reason why we decided to switch to Elasticsearch. When I was experimenting with it, I was able to get another node to automatically join a cluster by simply running elasticsearch on the same machine (not that this would provide any actual benefits).

With the default shard count of 5 and replica count of 1 per index, it immediately started to copy over the data to the new node. Once it was ready, it started to handle indexing and search queries.

I decided to kill off the new node I booted up. Elasticsearch shrugged and kept on running. This is while it was indexing and serving queries.

Scaling Elasticsearch horizontally is as simple as just spinning up another Elasticsearch instance on another box. With the right config, it’ll automatically join the cluster, get some data and start serving requests.

We’ve found that people tend to scale Solr vertically due to replicating being horrendously painful to set up or didn’t work. Replicating with Elasticsearch was ridiculously easy.

Solr 4.0-beta can scale and replicate much easier than 3.x, but I’m personally a bit wary of retrofitting something as complicated as scaling and replication.

Near real-time

One of our major annoyances with Solr was how long newly pushed documents would take to appear in the index. In Solr, you need to perform a commit which creates a new “searcher”, but with empty caches, so performance takes a hit while the caches warm up.

We decided on an autoCommit time of 5 minutes, so it would take up to 5 minutes for an updated item to appear in the index, which confused some authors.

Elasticsearch has a default refresh interval of one second.

That means within a second of pushing a document to an index, it would be searchable. Which is pretty cool.

Performance

Preliminary testing showed Elasticsearch to be significantly faster than Solr on the same hardware. I’ll be discussing what the performance was like for us further down in the blog post.

Scripting functionality

Elasticsearch has the ability to let you use scripts from anything from custom score queries to faceting. It defaults to mvel, but there are plugins that let you code in Javascript, Python or Groovy.

Real-time GET

Elasticsearch has the ability to let you fetch the document you stored in the index by ID without needing to do a refresh. This is useful for using Elasticsearch as a (secondary) data store.

Note: Solr 4.0-beta has real-time GET now.

Aliases

You can add as many aliases as you want to an index in Elasticsearch, and it will resolve the index to the appropriate index(es) so implementing cross-index searches is as easy as adding aliases. We use it to easily swap live indexes without the search layer needing to care which index is live or not.

Downsides

Unfortunately, perfect software does not exist. There are downsides to Elasticsearch.

At the time of writing, Elasticsearch is at 0.19.9, which is pre-1.0. Initially, the lack of a ‘stable’ release was an issue for us, but after talking to a few people who have been using Elasticsearch in production, we decided to go ahead with the switch.

The documentation is pretty sparse and lacks good examples, but I think it’s better organised better than Solr’s wiki. The Elasticsearch community is definitely much smaller and there are far fewer resources online compared to Solr, which is probably due to the massive age difference between the two. Solr has been around since 2004 whereas Elasticsearch was first released in 2010.

However, the speed that bugs get fixed is pretty astounding. We found two bugs when we were adding it to our stack (#2045, #2197), and they were fixed within two days. Which is pretty bloody amazing. That said, I don’t know how fast the Solr team fixes bugs.

How we did it

The first step was getting our data into Elasticsearch. This was fairly easy since all we needed to do was turn our data into a hash and peg it at Elasticsearch. The hardest part was figuring out how we wanted to have our indexes and how many shards we wanted. We settled on having an index per site for easier maintenance and selective scaling, and kept the shard count to five as per defaults.

The next step was re-implementing the search functionality we had with Solr. Figuring out what our existing searches mapped to in Elasticsearch took a while, mainly because the documentation was a bit sparse, but Elasticsearch Head is a great tool to quickly browse and query an ES cluster.

Once we were able to do an Elasticsearch query for every marketplace, we wanted to verify that it can handle the existing load. We added a shim where it would push every single search with Solr into a Redis queue so workers outside of the web request can replay the same search against Elasticsearch.

This worked pretty well, apart from a bug in the shim that actually caused the workers to replay three times the production load. Elasticsearch handled it gallantly. Performance details are below.

The next step was improving the search interface and adding new facets for 3DOcean. This was by far the hardest part, and not a Solr or Elasticsearch problem.

After some internal testing, we switched over to Elasticsearch on 3DOcean for everyone.

Monitoring

We tried out Sematext’s SPM for monitoring ES since it was the easiest to set up. However, we were seeing one to two hour delays for the graphs to update and it left logs on the disk which it didn’t clean up, so we ended up ditching it.

We added app-level metrics on how long search requests were taking in to NewRelic in the same way that we do for Solr to have a good comparison point.

Each metric is broken down by site (AudioJungle, PhotoDune, etc) and type of search (standard search or a similar item search).

Performance

Take everything below with a grain of salt since this is the performance we get with our dataset. Your mileage may vary. For these tests, we had one Solr instance and one Elasticsearch node.

Both the Elasticsearch and Solr boxes are specced the same:

  • 4 cores on a Intel Xeon CPU E5645 @ 2.40GHz
  • 12GB RAM
  • 15GB of SAN-backed storage

With Solr, we’ve tuned the caches a bit to ensure not too many evictions happen (field cache tends to not evict for us). We are using an autoCommit time of 5 minutes and 100k documents (which we never reach). An optimize happens every 20 minutes.

With Elasticsearch, we are using a refresh interval of 1 second. We haven’t added automated optimizing yet mainly because it seems to perform fine without having to do it all the time, where Solr definitely needs it. We left the cache stuff at defaults. Each index uses the default 5 shards.

Query performance

Solr Search Response ES Search Response

The graph above shows the response times of Solr and Elasticsearch of a day’s worth of production search traffic. Each colour is a different site. There is a tiny delay (<5s) between a search happening on Solr and that search being replayed on Elasticsearch.

Solr tends to average between 100ms to 190ms between sites. With Elasticsearch, most sites average between 25ms to 45ms, with the 120ms to 190ms for the biggest index.

MoreLikeThis performance

The graph below shows the MoreLikeThis performance between Solr and Elasticsearch.

Solr MoreLikeThis Response ES MoreLikeThis Response

As you can see, there is a massive difference. Solr tends to average 400ms, with the fastest averaging 60ms. Elasticsearch handles everything in under 100ms and as low as 25ms!

Another interesting chart is the CPU usage on each box. The only things running on each box is Solr/Elasticsearch and the Scout agent.

One day of Solr vs Elasticsearch CPU usage

Elasticsearch averages between 3-5% CPU but Solr uses 20%-50%! This could be attributed to the optimize calls happening for every data point, but if we look at a smaller timescale…

Five hours of Solr vs Elasticsearch CPU usage

… We see a similar trend. ES is 2-5% for the most part, with a spike to 13%. Solr averages around 25% with a spike to 42%.

Indexing performance

Indexing performance was significantly improved with Elasticsearch. We started with a request per document to index (very very sub-optimal) and found it was already matching our bulk index strategy for Solr.

Once we moved to using _bulk to submit updates in batches of 1000, we were indexing 4x faster than Solr. Parallelisation brought this up to 16x. Given we weren’t parallelising indexing for Solr, this is an unfair comparison, but 4x is still a significant improvement.

We use a refresh interval of 60s when we’re doing bulk indexing.

Infrastructure

We are running two ES data nodes on same-specced boxes as stated above. Each one is configured to use the local gateway for simplicity, and each one can become the master.

For accessing the ES data nodes, we had a few options:

  • HAProxy to load balance between the two nodes
  • Implement our own failover code
  • Run a client node instance on every machine that needs to access the data nodes

We decided against HAProxy because it would become a single point of failure. Implementing our own failover code would be re-inventing the wheel. Both of these would require managing the list of ES nodes, which was not ideal.

Using client nodes means running an instance of Elasticsearch on each machine you would want to use to talk to data nodes, which means losing 100-250MB RAM to Java, but it makes everything much easier.

Rather than having to define a list of ES hosts to talk to, you simply talk to the client node on localhost:9200. The client nodes become part of the cluster but don’t store any data or perform searches, but they know which node has what shard, and are notified of nodes going up or down.

This means that when one of the data nodes go down, all the clients are notified and will direct queries to another node with minimal interruption. Also, when there are shards that only exist on a particular node, the client will talk to that node directly rather and avoid the potential extra hop.

What’s next?

The next step is to flip other sites to Elasticsearch one by one until we can ditch Solr. The ultimate goal is to have the browsing aspect of the Marketplaces be backed by Elasticsearch completely with its real time GET so we can take some load off our DB and have the site usable when we do migrations.

Conclusion

If you’re looking for a straightforward and scaleable search solution but don’t mind the smaller community, give Elasticsearch a go.