Have you ever wondered how some searches result in the actual information within a short amount of execution time? Most of the cases may use some other alternatives like ElasticSearch. Is it possible with just PostgreSQL Database? Yes, it is. For such a faster search, we have a more advanced searching technique in PostgreSQL called Full-text Search. For searching a database for some words which may occur in some column of some table which you are unsure, it is not recommended to use like queries. It might be possible to find out the result with the power of regular expressions, but it is slow. Here comes the importance of Full-text Search.
What is Full-Text Search?
Full-text search is a technique for searching documents or collections in a database. The text search engine can parse the documents and save the associations of keywords called lexeme with the parent document. These associations are used for searching the documents against query keywords.
Disadvantages of LIKE operator search
- The LIKE operator has no linguistic support. The regular expression can support LIKE operator searches up to a certain extent and still lack the mechanism to handle plural words and derived words. A single word can have many derived words that are tedious to find out using regular expressions.
- They have no ordering on search results which may lead the search worthless if the search matches thousands of records.
- They need to scan all the rows for finding the result makes the search pretty much slow.
How Full-Text Search can be implemented?
For a full-text search, documents must be preprocessed to a vector format in which the searching and ranking are performed. The original text will be retrieved only when the user is intended to fetch it. PostgreSQL has two main functions for implementing a full-text search.
- to_tsvector : parses the document to get a list of tokens, reduces tokens to lexemes and produces a list of lexemes with their position as a tsvector. ‘ts’ in the tsvector represents ‘text search’. A list of dictionaries will be consulted for each token and thus eliminates multiple normalized lexemes representing a token.
- to_tsquery : for querying the tsvector for certain words, phrases or complex queries separated by boolean operators & (AND), | (OR) or ! (NOT).
The to_tsvector will omit the stop words, articles and conjunctions which carry less meaning in a document. For eg:
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');
which returns true as output.
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('foxes');
also, return true as output which is the great benefit of considering the semantic meaning of the lexemes.
Ranking the search result
If the search operation results in thousands of records, if the results are not ranked or sorted by relevance, the results will be ineffective. Ranking the search results based on some criteria is one of the greatest advantages of a full-text search. The ranking may depend on many factors. PostgreSQL provides two predefined ranking functions, one takes into account lexical, proximity, and structural information and another one on how important is the part of the document where they occur.
The ranking criteria needed for your application may differ from that of functions provided by the PostgreSQL. For your specific application needs, you can tailor the ranking functions by writing it own or combining the results with additional factors. The built-in functions for ranking are given below.
- ts_rank: ranks the tsvector based on the frequency of matched lexemes.
- ts_rank_cd: ranks the tsvector based on cover density which needs the positional information as to its input.
Performance Improvement on Full-Text Search
Creating ts_vector on run time for a number of records will not produce the result in lightning speed. For expecting the faster execution, we need to pre-create the vector for each new entry created to our table. For maintaining such vectors as persisted data, either we can save the vector in a column for the same table, or can be created as a materialized view for creating a vector from multiple tables.
A Materialized view is a cached set of data that allows us to query instead of querying the tables. Accessing the data through a materialized view is much faster than accessing the data through views or tables directly. The materialized views don’t execute the underlying query every time you access them which makes the materialized view faster than normal database views. To update the data, we need to refresh the materialized view.
Indexing is another powerful way to boost the execution speed to a great extent. Postgres provides two ways for indexes – GiST (Generalized Search Tree) and GIN (Generalized Inverted Index) indexes. If our database is read-intensive, we should use GIN index as it is slower to build but faster to query, while GiST index is faster to build but require additional reads.
Advantages of Full-Text Search
- For such a great search tool, there is no need to install additional software or libraries such as ElasticSearch.
- Faster execution – When compared to the LIKE operator, the Full-Text Search will return search results in an execution speed 10 or more times faster.
- The full-text search is performed on the same database where our business data is saved and there is no need to sync the data which might be needed if we use other search techniques like ElasticSearch.
- Able to rank the search results based on ranking functions and weights we assign to the vector. Thus it produces better results than simply returning the total matched list of records.
The full-text search in PostgreSQL is very fast and powerful. This advanced search also allows searching any phrases, plural words, tenses, and other related words. Coupling the full-text search with materialized views and indexes boost the lightning speed of execution. I hope you have a clear basic idea of what is full-text search, why it is used and how it can perform better.