search-Why are Lucene/Elasticsearch prefix queries slower than term queries?

MatsLindh 2020-12-11 04:28:11

I'm not that familiar with ES specific details so they might be doing something else than plain Solr - but #1 is not the case usually.

A prefix match will be more expensive than looking up a single term, but it's not that much more expensive. It can be compared to doing a range search (which you can perform if you want to - field:[aa TO ab) could be compared to doing field:aa* (in theory); effectively retrieving all tokens that lie within that range, then resolving the document set that matches those tokens.

The fact that there are more tokens that match means that you can't simply take the list attached to a single token (a matching term) and retrieve those documents, but you have to retrieve a possibly large set of matching tokens and then compute the document set for that. However, this is not a very expensive computation, but it is more expensive than just a single match. The lookup can be done by finding the starting and end indexed of the matching tokens in the index, then retrieving all terms between those two and find the set of matching document ids.

A query of foo* against an indexed with the following terms:

bar, baz, foo, foobar, spam
          ^----------^

will collect the list of documents attached to foo and foobar, merge it and then retrieve the documents.

Slower does not mean that it's catastrophic or not optimised in any way; just that it's more expensive than a direct match where the set of documents has already been determined. However, you probably have more than one term in your query already, so the same process (albeit slightly higher up in the hierarchy) happens there as well.

A postfix match (your #2) - i.e. matching a wildcard at the beginning of the token - is expensive, since all tokens in the index usually has to be considered. The index have the terms sorted alphanumerically, so when you want to only look at the end of the string you have to consider that each token could match, regardless of where it's located in the index - so you get a full index scan. However, if this is a use case you see happening often, you can use the reverse wildcard filter. This works by reversing the string and having tokens that match the terms in reverse order, so that foo is indexed as oof and a wildcard search gets turned into oof* instead.

A query of *ar against an indexed with the following terms:

bar, baz, foo, foobar, spam
?!   ?    ?    ?!      ?

will have to look at each term to decide if it ends with ar.

The reason for using an EdgeNGramFilter (your comment / #3) is that you move as much of the required processing as possible to indexing time (doing the work that you know do query time, even if prefix queries aren't really expensive, they still have a cost), and additionally: wildcard queries does not support most analysis. So many people end up with wildcard queries against a set of tokens that have been stemmed or otherwise processed, and are then surprised when their wildcard queries doesn't generate a match. Only a small subset of filters can be applied to wildcard queries (such as the LowercaseFilter). Those filters are known as being "Multi term aware", since the terms the process can end up being expanded to multiple terms before collection of documents happen.

Another reason is that using an EdgeNGramFilter will give you proper frequency scores for each prefix, giving you effective scoring for prefixed terms as well.

John Magnus 2020-12-10 20:39:02

This makes a tremendous amount of sense, thank you! The article i referenced (medium.com/@mourjo_sen/…) mentioned "...prefix-like queries should never be used in production environments is that they are extremely slow. The reason for this is that the tokens in ES are not directly prefix-able. So Elasticsearch has to check every token at runtime to see if it starts with the required text." from what you've said this doesn't see to be true. You'd only analyze all tokens within the range. is that right?

MatsLindh 2020-12-10 22:04:40

You don't analyze tokens; tokens are the result of the tokenizer and then processed through analysis before being stored. The input text is analyzed and filtered, and the resulting tokens are stored in the Lucene index. I'm not familiar enough with the details of Elasticsearch to say what the article is referencing, but from what the article says it talks about match_prefix_queries, which is a different feature from prefixed queries with wildcards (which you can see from it expanding to maximum 25 values and performing analysis, neither which is what a wildcard prefix query would do).

Why are Lucene/Elasticsearch prefix queries slower than term queries?

热门帖子

热门github