Elasticsearch, Search, WordPress Plugins

Valuable Lessons Learned in ElasticPress

ElasticPress is a 10up WordPress plugin project that integrates Elasticsearch with WordPress. As we all know search in WordPress is not a great experience. Why? Well, MySQL is not a database optimized for search. Thus ElasticPress was born.

1. Search result relevancy scores on sites with high post to shard ratios can vary depending on order of indexing.

We first noticed this in our integration testing suite. We were using three shards across 1 primary node. Depending on the order that posts were indexed, different relevancy scores were returned for the same search.

Elasticsearch relevancy scores are calculated as term frequency / inverse document frequency. Term frequency is the number of times a term appears in the query field of the current document (or post). Inverse document frequency measures how often the term appears in all query fields across all documents in the index of the current shard. Notice I said shard NOT index. The shard a post lives on is determined by the number of shards and the size of the index. We can’t exactly predict relevancy scores for a search on an index across more than one shard. The Elasticsearch documentation has a great article on this.

The solution for testing purposes is to only use one shard. In the real world, this shouldn’t matter as inconsistencies plateau as index sizes grow larger. However, this is still something to be aware of.

2. There is no right search algorithm for WordPress. Fine tuning algorithms is an on-going, collaborative process.

As of ElasticPress 1.1, the meat of our default search query looked like this:

{
  "query": {
    "bool": {
      "must": {
        "fuzzy_like_this": {
          "fields": [
            "post_title",
            "post_excerpt",
            "post_content"
          ],
          "like_text": "search phrase",
          "min_similarity": 0.75
        }
      }
    }
  }
}

fuzzy_like_this is great. It combines fuzzy and more_like_this queries. fuzzy searches against a set of fuzzified terms (using the levenshtein distance algorithm). more_like_this selects “interesting” terms based on a number of factors like document frequency and checks each document against those terms.

The problem we encountered was that in certain established indexes exact matches were not getting boosted to the very top of results. This was due to the way the fuzzy_like_this algorithm works. We added an extra query to our search algorithm in 1.2 to boost exact matches:

{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "search phrase",
            "boost": 2,
            "fields": ["post_title", "post_content", "post_excerpt"]
          }
        },
        {
          "fuzzy_like_this": {
            "fields": ["post_title", "post_excerpt", "post_content"],
            "like_text": "search phrase",
            "min_similarity": 0.75
          }
        }
      ]
    }
  }
}

The should query tells Elasticsearch that one of the multi_match or fuzzy_like_this queries must be true for a document to match. It then boosts anything found multi_match x2.

This solved our immediate problem but is not the perfect algorithm. We expect to continually optimize this for WordPress over time. (Note that ElasticPress allows you to filter the search query entirely if you want to customize it.)

3. Disable indexing during imports.

By default ElasticPress indexes when a post is created. This is great until you try to import a few thousand posts, and your Elasticsearch instance gets overloaded. This bit us pretty hard. As of newer versions, ElasticPress disables syncing during WordPress imports big or small.

Database Theory, WordPress Code Techniques

A Performant Way to Feature Posts

In a past blog post I explained why featuring posts using a taxonomy term is much more performant than using a meta query. The comment I get from people is “that’s awesome, but using tags to feature a post is not a good user experience”. I agree, attaching a “featured” tag to featured posts, while performant is not a good experience for users because it leaves room for error on the admin side and shows the “featured” tag to users on the front end (if you are listing your tags).

Thankfully, there is a much better way to do this. We can create a small meta box with a “Featured Post” checkbox. This checkbox will add/remove a term in a hidden taxonomy from the post. Here is what the meta box will look like in WordPress 3.9:

Featured post meta box

I will take you through the code necessary to set this up. First we need to register a private taxonomy for internal use:

function tl_register_taxonomy() {
  $args = array(
    'hierarchical' => false,
    'show_ui' => false,
    'show_admin_column' => false,
    'query_var' => false,
    'rewrite' => false,
  );

  register_taxonomy( 'tl_post_options', array( 'post' ), $args );
}
add_action( 'init', 'tl_register_taxonomy' );

Now let’s write the code that actually associates the taxonomy term with featured posts. This will hook onto the “save_post” action.

function tl_save_post( $post_id ) {
  if ( ( defined( 'DOING_AUTOSAVE' ) && DOING_AUTOSAVE ) || ! current_user_can( 'edit_post', $post_id ) || 'revision' == get_post_type( $post_id ) )
   return;

  if ( ! empty( $_POST['additional_options'] ) && wp_verify_nonce( $_POST['additional_options'], 'additional_options_action' ) ) {
    if ( ! empty( $_POST['tl_featured'] ) ) {
      $featured = term_exists( 'tl_featured', 'tl_post_options' );
      if ( empty( $featured ) ) {
        $featured = wp_insert_term( 'tl_featured', 'tl_post_options' );
      }
      wp_set_post_terms( $post_id, array( (int) $featured['term_id'] ), 'tl_post_options' );
    } else {
      wp_set_post_terms( $post_id, array(), 'tl_post_options' );
    }
  }
}
add_action( 'save_post', 'tl_save_post' );

Next, we need to output a meta box with a checkbox. If this checkbox is checked, the post is marked as featured and the appropriate information is sent to the “save_post” hook on POST.

function tl_meta_box_additional_options( $post ) {
  wp_nonce_field( 'additional_options_action', 'additional_options' );
  $featured = has_term( 'tl_featured', 'tl_post_options', $post );
  echo 'Featured: <input type="checkbox" name="tl_featured" value="1" ' . ( ( $featured ) ? 'checked="checked"' : '' ) . '>';
}

Don’t forget we need to actually register our new meta box:

function tl_add_meta_boxes() {
  add_meta_box( 'tl_additional_options', 'Additional Options', 'tl_meta_box_additional_options', 'post', 'side' );
}
add_action( 'add_meta_boxes', 'tl_add_meta_boxes' );

Querying for posts on the front end is super easy! Here is an example query:

$query = new WP_Query( array(
  'tl_post_options' => 'tl_featured'
  'post_status' => 'publish',
  'post_type' => 'post',
) );