Text Search With Trigrams
Hey folks, consider the situation you are writing a mail to your manager and you misspelled the word “cancel” as “cancer”. Then your text editor comes to the rescue and provides suggestions that you should type cancel instead of cancer.
How does it find that spelling mistake? If you are wondering and want to know the answer then this blog post is for you.
Fuzzy Text Search
Fuzzy string search finds words similar to the given search term. It is useful in cases where you don’t know the exact spelling of the search term. To do fuzzy searching we can make use of trigrams. Trigrams, what are they? Trigrams are any three consecutive letters of a given string. For example, the trigrams of the word cancel are can, anc, once, and cel. We cannot form any trigrams more than 4 Because we need at least three characters to form a trigram.
We can measure the similarity of any two words by counting the number of trigrams they share.
How to integrate trigrams with PostgreSQL?
The extension pg_trgm extensions should be installed in Postgres to work with trigrams. By typing the below command in the Postgres client we can install the extension.
CREATE EXTENSION pg_trgm IF NOT EXISTS;
You can check the list of extensions by executing the following command.
SELECT * FROM pg_extensions;
As we have discussed earlier we need at least three characters to form a trigram. But in Postgres along with that, we have some additional rules. They are as follows:
- It ignores non-alpha-numeric characters
- Each word is assumed to have two spaces at the start of the string.
- Each word is assumed to have one space at end of the string.
- If there is a single space in between words that are converted into two spaces.
Let’s see some trigrams in action.
We can see the trigrams of a word by using the function show_trgm in Postgres. It takes a string as an argument and returns the trigrams of that string.
SELECT show_trgm('Cancel');
As you could see all the words are capitalized, and there are two spaces before the first letter of the word c and one space at the end of the word near ‘el'(see the 2nd trigram from right).
As we have now understood how to form trigrams.
Now pay close attention as we will be looking at the main concept behind fuzzy searching.
Measuring similarity between words using trigrams.
How will you say if two words are similar?
There are many ways right? We can tell by the spelling, the semantic meaning, or the way they sound. But how do we find similarities here? We will find the similarity of the worst by the number of trigrams they share. and some variation around this technique.
pg_trgm offers three functions to find similarities between strings. All the functions we will be discussing are just different from one another in minor ways.
similarity(string1, string2):
This function takes two strings and counts the total number of trigrams they share. Then it divides the number of shared trigrams by the total number of trigrams in both strings.
Let’s look at the similarity score between the words “Cancer” and “Cance”. Running the below command will return trigrams of both cancer and cancel and similarity score between them.
SELECT show_trgm('Cancel'), show_trgm('Cancer'), similarity('Cancel', 'Cancer');
As you can see above both words share the common trigrams { ” c”, ” ca”, anc, can,nce}.
In the below image you can see the similarity score has dropped drastically because though the number of common trigrams will stay the same but the number of total trigrams has increased.
word_similarity(string1, string2)
The word_similarity_method is useful for finding the records with partial search term.
Let’s take two strings dream and my dreams. The ordered set of trigrams for both strings would be
String | Ordered set of trigrams |
dream | {” d”, ” dr”, dre,rea,eam,”am “} |
my dreams | {” m”, ” my”, “my “, ” d”, ” dr”, dre, rea, eam, ams, “ms “} |
Here first we need to find the number of ordered shared trigrams in the second string. we have 5 shared ordered trigrams. To find the word similarity we need to find the number of commonly ordered trigrams by the total number of trigrams in the first string.
SELECT show_trgm('dream'), show_trgm('my dreams'),word_similarity('dream', 'my dreams');
strict_word_similarity(text1, text2)
The strict_word_similarity finds the highest similarity between words of two strings. The order of trigrams also plays an important role in this function as well.
It produces the similarity value by following the below steps:
- Calculate trigrams for both the first and second strings.
- Then find the similarity between trigrams of the first string and trigrams of each word in the second string. After this, it will return the highest matching similarity.
SELECT show_trgm('dream'), show_trgm('my dreams'),strict_word_similarity('dream', 'my dreams');
Now the fun part, searching through the database using trigrams.
pg_trgm extension offers operators which can be used in where clause to filter records. All the similarity threshold value defaults
Operator | Usage | Description |
% | text % text | return true if the similarity between two strings is greater than the similarity threshold(default is 0.3) |
<% | text <% word | returns true if word_similarity between two strings is greater than the word_similarity threshold(default is 0.6) |
<<% | text <<% | return true if strict_word_similarity between two strings is greater than the strict_word_similarity threshold(default is 0.5) |
For this demo, I have a table in my local system which has the surnames and forenames of several persons.
Let’s search for persons who have ‘lin’ in their surnames using word_similarity.
SELECT DISTINCT(surname) from persons WHERE surname%>'lin';
Let’s search for persons whose surnames are matching with the search term ‘kin’ using similarity.
SELECT DISTINCT(surname) from persons WHERE surname%'kin';
Let’s search for persons whose surnames are matching with search term ‘kin’ using strict_word_similarity.
SELECT DISTINCT(surname) from persons WHERE surname%>>'kin';
With this, I conclude this post. In the upcoming blog posts, we will look at how to optimize the search using trigram indexes and how to integrate this with a rails application.
Feel free to leave your thoughts and doubts in the comments section.
To learn more about pg_trgm: https://www.postgresql.org/docs/current/pgtrgm.html
Thanks for reading!!
To learn more about Engineering topics visit – https://engineering.rently.com/
Get to know about Rently at https://use.rently.com/
A programmer.