Document Boosting in ElasticSearch

Like many developers raised on a steady diet of PHP/MySQL, doing a search with a LIKE = ‘%term%’ was second nature to me. It usually worked well enough for what I was using it for.

Then, I started working on a project called FilmedInsert. FilmedInsert is a site that is best described as an IMDB for music videos – we collect credit (and other data) about music videos and are starting to accept submissions from beta users. We are currently at around 7,000 music videos, so we have a ways to go, but we have around 300,000 name entities in our database. Many are from data input for credits, but the majority we acquired through the open source music database Discogs. (Note: Discogs has 2.5million+ artists, but we pared the numbers down based on which entries would most likely be used to reduce a lot of data management overhead.)

Obviously, MySQL LIKE isn’t going to cut it here – not even close. MySQL fulltext isn’t even going to make it work here either. We need some fine-grain control. Enter ElasticSearch. It’s incredibly fast, incredibly flexible, and once you get the hang of it, a breeze to work with. We’re humming along with our 300,000 document database and we couldn’t be happier with the performance and ease of use.

When developers learn ElasticSearch, they start with the basics: documents, mappings, types, etc. There are a lot of great tutorials out there that cover those concepts well (my personal favorite starting point is the very clear Elastica documentation). But there is one feature that I don’t think gets enough attention: document boosting.

Boosting in ElasticSearch is the concept of giving one thing weight over the other. So, when setting up your mappings, you can boost one field over the other, and when doing a search over all fields, the relevancy of the field with the higher boost rate will be taken into account. This is a great feature, and a ton of the work is done inside ElasticSearch, but what about the things that ElasticSearch just can’t know?

Let’s take a case from FilmedInsert. When we imported data from Discogs, we had some instances where names were very similar, or exact duplicates. In a name database, this is inevitable. Let’s take the case of M.I.A., your favorite English/Sri Lankan recording artist. Mention the name M.I.A. to anyone, and there’s only one artist they are probably thinking of. (Hint: it’s the one that performed at the Superbowl this year).

The problem is, we have 7 instances of someone named M.I.A. in the database. They are all legitimate artists, but people are almost certainly looking for the M.I.A. we are all thinking of.

ElasticSearch doesn’t know this. How could it Only we have the data points to know which M.I.A. out of the 7 is likely the real target of their search, and which M.I.A. we want is very unique to our situation. Remember, FilmedInsert is a music video database, and M.I.A. has a few very popular music videos. Her video for “Bad Girls” just won several awards at the VMAs, and she works with top-tier directors like Romain Gavras. So how do we teach ElasticSearch about this logic? The most effective way I’ve come across is document boosting.

Document boosting is simple – when you add or update a document, pass a numerical ‘_boost’ value to the document as well. ElasticSearch will take that into account when searching (along with all the other criteria you have put in place via mapping and filters). The beauty of this feature is that it is a numerical value that we decide on our own. We can do anything we want, and we can really tailor it to our situation.

In the case of FilmedInsert, we have a custom set of criteria that we look at the database for, weight them, do some other secret things, and then come up with a number. So in the case of M.I.A., she has a lot of music videos, those videos are popular, and her database connections are to other strong entities. This merits a high boost number. The M.I.A. that has no videos and no credits? Lower boost number. They will appear, but much lower for another day when we need to add database info for them.

You can adapt this to your own database needs. Got an article that people bookmark often? Maybe you want to boost that in your search results a little bit. Have a large name database like we do at FilmedInsert? Give some thought to why documents may need to be towards the top more than others. It all depends on your user and the specific application.

We still have a long way to go with FilmedInsert search, and we tweak our methods nearly every day, but document boosting gave us a huge leg up to making sure that entity searching returns accurate, relevant results.

 

This entry was posted in Code and tagged , . Bookmark the permalink.