ElasticPress is a 10up WordPress plugin project that integrates Elasticsearch with WordPress. As we all know search in WordPress is not a great experience. Why? Well, MySQL is not a database optimized for search. Thus ElasticPress was born.
1. Search result relevancy scores on sites with high post to shard ratios can vary depending on order of indexing.
We first noticed this in our integration testing suite. We were using three shards across 1 primary node. Depending on the order that posts were indexed, different relevancy scores were returned for the same search.
Elasticsearch relevancy scores are calculated as term frequency / inverse document frequency
. Term frequency is the number of times a term appears in the query field of the current document (or post). Inverse document frequency measures how often the term appears in all query fields across all documents in the index of the current shard. Notice I said shard NOT index. The shard a post lives on is determined by the number of shards and the size of the index. We can’t exactly predict relevancy scores for a search on an index across more than one shard. The Elasticsearch documentation has a great article on this.
The solution for testing purposes is to only use one shard. In the real world, this shouldn’t matter as inconsistencies plateau as index sizes grow larger. However, this is still something to be aware of.
2. There is no right search algorithm for WordPress. Fine tuning algorithms is an on-going, collaborative process.
As of ElasticPress 1.1, the meat of our default search query looked like this:
{ "query": { "bool": { "must": { "fuzzy_like_this": { "fields": [ "post_title", "post_excerpt", "post_content" ], "like_text": "search phrase", "min_similarity": 0.75 } } } } }
fuzzy_like_this
is great. It combines fuzzy
and more_like_this
queries. fuzzy
searches against a set of fuzzified terms (using the levenshtein distance algorithm). more_like_this
selects “interesting” terms based on a number of factors like document frequency and checks each document against those terms.
The problem we encountered was that in certain established indexes exact matches were not getting boosted to the very top of results. This was due to the way the fuzzy_like_this
algorithm works. We added an extra query to our search algorithm in 1.2 to boost exact matches:
{ "query": { "bool": { "should": [ { "multi_match": { "query": "search phrase", "boost": 2, "fields": ["post_title", "post_content", "post_excerpt"] } }, { "fuzzy_like_this": { "fields": ["post_title", "post_excerpt", "post_content"], "like_text": "search phrase", "min_similarity": 0.75 } } ] } } }
The should
query tells Elasticsearch that one of the multi_match
or fuzzy_like_this
queries must be true for a document to match. It then boosts anything found multi_match
x2.
This solved our immediate problem but is not the perfect algorithm. We expect to continually optimize this for WordPress over time. (Note that ElasticPress allows you to filter the search query entirely if you want to customize it.)
3. Disable indexing during imports.
By default ElasticPress indexes when a post is created. This is great until you try to import a few thousand posts, and your Elasticsearch instance gets overloaded. This bit us pretty hard. As of newer versions, ElasticPress disables syncing during WordPress imports big or small.