Warm tip: This article is reproduced from serverfault.com, please click

Implement smart search by text field in database

发布于 2020-11-27 23:28:28

Suppose I have a table named movie with a field name, that is a VARCHAR field.

I want to implement a search bar in my website, so that when a user enters a string I can query movies with that name.

My first approach was quite naive:

select *
from movie
where name like '%user_string%';

The limitations are:

  1. Special characters. Say the user string is "Lets go", I want it to return the movie with the name "Let's go" even if the apostrophe is missing.
  2. Accents. Say the user string is "Pokemon", I want it to return the movie with the name "Pokémon" even if the accent is missing.

My idea was to create an additional normalized_name field, that is computed using the name field with all special characters and accents stripped. Then the query would become:

select *
from movie
where normalized_name like '%user_string%';

For instance: user searches for pokemon, the database query return a movie with the normalized_name = pokemon, which real name is Pokémon. Obviously the user string would be first normalized too - in order to allow searching by the movie real name too.

Now, is this a valid approach? What is the most widely used one - that may also make the search even better? Is there any literature on the matter?

Questioner
Matteo Silvestro
Viewed
0
Laurenz Albe 2020-11-30 14:56:23

Create a trigram index on a stripped version of the column:

  1. Create the necessary extensions and create an immutable unaccent called f_unaccent (see here for more information):

    CREATE EXTENSION pg_trgm;
    CREATE EXTENSION unaccent;
    
    CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
    RETURNS text LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
    '$libdir/unaccent', 'unaccent_dict';
    
    CREATE OR REPLACE FUNCTION public.f_unaccent(text)
    RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
    $func$
    SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
    $func$;
    
  2. Create the trigram index on the column:

    CREATE INDEX ON movie USING gin (translate(f_unaccent(name), '''', '') gin_trgm_ops);
    
  3. Now perform the following query:

    SELECT * FROM movie
    WHERE translate(f_unaccent(name), '''', '') ILIKE translate(f_unaccent('user_string'), '''');