Elasticsearch, Search, WordPress Plugins

Valuable Lessons Learned in ElasticPress

ElasticPress is a 10up WordPress plugin project that integrates Elasticsearch with WordPress. As we all know search in WordPress is not a great experience. Why? Well, MySQL is not a database optimized for search. Thus ElasticPress was born.

1. Search result relevancy scores on sites with high post to shard ratios can vary depending on order of indexing.

We first noticed this in our integration testing suite. We were using three shards across 1 primary node. Depending on the order that posts were indexed, different relevancy scores were returned for the same search.

Elasticsearch relevancy scores are calculated as term frequency / inverse document frequency. Term frequency is the number of times a term appears in the query field of the current document (or post). Inverse document frequency measures how often the term appears in all query fields across all documents in the index of the current shard. Notice I said shard NOT index. The shard a post lives on is determined by the number of shards and the size of the index. We can’t exactly predict relevancy scores for a search on an index across more than one shard. The Elasticsearch documentation has a great article on this.

The solution for testing purposes is to only use one shard. In the real world, this shouldn’t matter as inconsistencies plateau as index sizes grow larger. However, this is still something to be aware of.

2. There is no right search algorithm for WordPress. Fine tuning algorithms is an on-going, collaborative process.

As of ElasticPress 1.1, the meat of our default search query looked like this:

{
  "query": {
    "bool": {
      "must": {
        "fuzzy_like_this": {
          "fields": [
            "post_title",
            "post_excerpt",
            "post_content"
          ],
          "like_text": "search phrase",
          "min_similarity": 0.75
        }
      }
    }
  }
}

fuzzy_like_this is great. It combines fuzzy and more_like_this queries. fuzzy searches against a set of fuzzified terms (using the levenshtein distance algorithm). more_like_this selects “interesting” terms based on a number of factors like document frequency and checks each document against those terms.

The problem we encountered was that in certain established indexes exact matches were not getting boosted to the very top of results. This was due to the way the fuzzy_like_this algorithm works. We added an extra query to our search algorithm in 1.2 to boost exact matches:

{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "search phrase",
            "boost": 2,
            "fields": ["post_title", "post_content", "post_excerpt"]
          }
        },
        {
          "fuzzy_like_this": {
            "fields": ["post_title", "post_excerpt", "post_content"],
            "like_text": "search phrase",
            "min_similarity": 0.75
          }
        }
      ]
    }
  }
}

The should query tells Elasticsearch that one of the multi_match or fuzzy_like_this queries must be true for a document to match. It then boosts anything found multi_match x2.

This solved our immediate problem but is not the perfect algorithm. We expect to continually optimize this for WordPress over time. (Note that ElasticPress allows you to filter the search query entirely if you want to customize it.)

3. Disable indexing during imports.

By default ElasticPress indexes when a post is created. This is great until you try to import a few thousand posts, and your Elasticsearch instance gets overloaded. This bit us pretty hard. As of newer versions, ElasticPress disables syncing during WordPress imports big or small.