Some Tips for Designing Tokenizer for Search Engine

Take Chinese, English and Japanese for Example

18 Jan, 2025

I'm planning to do some optimizations for full-text search engine for my blog site recently and enhance it with better Chinese and English language support, and there are some ideas flips through my mind. For all, there are some hands-on service like Algolia, but I prefer do explorations on my own over this realm and now I have tried to conclude some tips.

Generally speaking, to devise a full-text search engine, it should be got through these following processions when I get a raw text.

Entering meta data: The metadata, which is common for every texts but indicates their uniques, such as title, author, keywords and summary. This is not difficult and different programs have their own procession details.
Remove redundant ingredients: Most of texts are formatted with markup syntax to express their roles in text components, which unnecessarily should not to be entered in index. For example, the <img/> element could be removed from text by library JSDOM, Unified and so on.
Remove markups: The markup syntax that express their roles in text components also could be removed before indexing text. Like HTML tags and markdown markup syntax also could be removed by JSDOM or Unified.js.
Tokenization: Is a procession of great significance during the indexing, also what's this article mainly talking about. It divide a sentence or paragraph into words by special grammar rules. In most cases, the keywords should be more related to main ideas of the sentence or paragraph it belonged to. A good tokenization work are benefit to more accuracy for searching relations. But let just tokenization itself alone is not enough for work, sometimes it needs to be further processed, as part-of-speech tagging, removing stopwords and extracting word stems.
Indexing into data structure: Map the word set to its corresponding article and establish mapping relations with specified data structure(Inverted Index, Hash structure and Tire tree, etc), and ranking them by weight coefficient with specified rules from algorithm.

Getting Start with English

The English tokenizer is relatively easy to devise in comparison with Chinese and Japanese for words in English sentence are separated by blanks naturally. It's no need to tokenize words like Chinese sentence with grammar structure analyzing. We could simply make a tokenizer with JavaScript for given English sentence.

console.log("A quick fox jumps over the lazy dog.".spilt(" "));
// Output: ["a","quick","fox","jumps","over","the","lazy","dog"]

But that's not enough. For the sentence "I'm hungry" is combined with three words actually. The " I'm " inside the sentence in fact is abbreviation combined with "I" and "am", it should be separated into "I", "am", "hungry" rather than "I'm" and "hungry". The same situation includes "haven't", "didn't" and so on. But this situation is easy to do. You can make a replacement before word tokenization.

function expandContractions(sentence) {
  const contractionsMap = {
    "I'm": "I am",
    "you're": "you are",
    "he's": "he is",
    "she's": "she is",
    "it's": "it is",
    "we're": "we are",
    "they're": "they are",
    // More rules here...
    "what's": "what is",
  };
  const regex = new RegExp(
    "\\b(" + Object.keys(contractionsMap).join("|") + ")\\b",
    "gi"
  );
  return sentence.replace(regex, (match) => contractionsMap[match]);
}

const sentence =
  "I'm learning JavaScript and it's fun! He didn't know you were coming.";
const expandedSentence = expandContractions(sentence);
console.log(expandedSentence.spilt(" "));

That works well. But another problem arises: English is a inflectional language with many functional words in corpus, like "am", "where", "that" and "on", which contributes many proportions but have less actual meanings in language context, we also name them as "stopword". Although they have little practical significance because of their high frequency of occurrence, they contribute significantly to the indexing of documents, thus reducing the relevance of search results.

Especially in TF-IDF Statistical Model, the stops are of large frequency and its IDF approaches to zero with limited contributions, therefore, removing stop words will not significantly affect the calculation results.

These could be safely removed from indexing.

There are quite a few opensource stopword library in English on npm and github, so you can use them on hand naturally. There is an open source project stopword that contains stop words for many languages. Let's take the stopword package as an example. Its documentation can be found here.

npm install stopword

Then use this library to remove stop words to achieve the effect of streamlined word tokenization.

import { removeStopwords } from "stopword";
const oldString = "a really Interesting string with some words".split(" ");
const newString = removeStopwords(oldString);
// newString is now [ 'really', 'Interesting', 'string', 'words' ]

This will prevent unnecessary stop words from taking up a lot of indexing progress. However, this is still not enough.

We noticed these following two sentence,

She decided to paint the wall blue.
She made a decision to paint the wall blue.

For these two sentences the "decide" is verb while the "decision" is a noun, and they express the same thing in two different grammar structures. Examples abound in English, such as "fond" and "fondness", "disagree" and "disagreement", etc.

But if you tokenize it on autopilot simply and crudely with no optimization, the index will perceive decide and decision are two irrelevant words. And it will not offer the result "... decision to paint ..." if you search "... decide to paint ...". Therefore what's next to do is make index perceive decision and decide as one word so as to improve the relation of searching.

Indo-European languages have their own set of word formation. The same is true for English.

Generally speaking, words are composed of stems and affixes. The stems are used to express the original meaning, and an initial word with a suffix becomes a derived word. There are suffixes such as -ing, -tion, -able, -ism, etc., which are used to express part of speech and auxiliary meaning.

So, what's the common between the words decision and decide? Yes, they have the same word stem decid, likewise it also includes decisive,decided,deciding,decides,decider, etc. They could be perceived as the same word if they have same word stem.

There is a package to extract stem from word by linguistical theory for English words named stemmer.

npm install stemmer

To extract word stem from a given word,

import { stemmer } from "stemmer";

stemmer("considerations"); // => 'consider'
stemmer("detestable"); // => 'detest'
stemmer("vileness"); // => 'vile'

The problem of stems has been solved now, and it can further transform all different forms of words into a unified word stem to optimizing the indexing effect.

Think about how else you can optimize search relevance further?

We notice that

She purchases some discounted goods after work.
She buys some discounted goods after work.

Here the word Purchase and Buy are synonyms to each other and these two sentence use two different word but express one same meaning.

If the indexer only includes the sentence 'She purchases some discounted goods after work' , then searching for 'buy some goods' would similarly fail to retrieve any relevant results.

So we can do optimizations in aspect of synonym thesaurus.

Let's contrive such a work, to tokenize a sentence it not only outputs the words which exist in this sentence but also outputs their synonyms, for example, the sentence

She purchases some discounted goods after work.

will outputs such result

She, her, herself
purchases, buys, acquires, gets, obtains, procures
some, a few, several, certain, various
discounted, reduced, sale, cheaper, marked down, cut-price
goods, merchandise, products, items, commodities, wares
after, following, post-, subsequent to, behind
work, job, employment, labor, task, occupation

Then if I search "She procures a few commodities after labors." the origin sentence also is going to be provided and prompted, that's to enlarge the searching ranges and result spaces.

NOTE: Strictly there are no absolute "two words of equivalence" in English language. Every two words of synonym you meet always have their blur and subtle nuance in context.

We might as well add synonyms to the result while tokenization. And there are also libraries on npm that allows you query synonyms for correspond English word, and the common one is natural.

npm i natural

And outputs the tokenization

import natural from "natural";
import { stemmer } from "stemmer";
const wordnet = new natural.WordNet();

async function tokenize(sentence: string): Promise<string[]> {
  const tokenizer = new natural.WordTokenizer();
  const tokens = tokenizer.tokenize(sentence);
  const resultsSet = new Set<string>();
  tokens.forEach((token) => resultsSet.add(stemmer(token)));
  const lookupPromises = tokens.map(
    (token) =>
      new Promise<void>((resolve) => {
        wordnet.lookup(token, (definitions) => {
          definitions.forEach((definition) => {
            definition.synonyms.forEach((synonym) => {
              const stemmedSynonym = stemmer(synonym);
              resultsSet.add(stemmedSynonym);
            });
          });
          resolve();
        });
      })
  );
  await Promise.all(lookupPromises);
  return Array.from(resultsSet);
}

This code will not only tokenize a sentence into words but also their synonyms as a result which will highly improved the range of search accuracy.

Support Chinese

Chinese sentence has no blanks to separate every words compared to English one, and tokenization work on Chinese sentence could only be relayed on grammatical rules with linguistical analysis, which also causes more extra costs than English. So there are something Chinese tokenization models based on Artificial Intelligence and most of them are paid. Fortunately there is one powerful and free tokenizer named Jieba maintained by opensource community, and it satisfies most of out needs just for personal blog indexing. Here is a correspond node.js binding library.

Secondly, Chinese is not inflected but isolating languages that means Chinese word has no tense, sexual and personal changes, that is to say we need not take them into considerations when processing a word like extracting word stems. And other remain processing are same to English, removing stopwords, adding synonyms.

But most of blogs written in Chinese, especially technical writings, they always mixed with some English terms and sentences. So the sporadic English words exist in Chinese sentence should be considered and processed either.

This is actually very easy to do. The Jieba could separate not only Chinese words but also English words mixed in sentence, and we distinguish Chinese words and English words from separation result, and process English words by the measures above mentioned, and process Chinese and English words respectively.

import { Jieba } from "@node-rs/jieba";
import stopword from "stopword"; 

export function ChineseLanguageTokenizer(str: string): string[] {
  const NonChineseAndLatinRegex =
    /[^\u4e00-\u9fff\u0041-\u005a\u0061-\u007a\u00c0-\u017f\u0180-\u024f\s]/g;
  const jieba = new Jieba();
  const cleanedStr = str.replace(NonChineseAndLatinRegex, " ");

  const tokens = jieba.cutForSearch(cleanedStr, true);

  const englishWords = stopword.removeStopwords(
    tokens.filter((word) => /^[a-zA-Z]+$/.test(word))
  );
  const processedEnglishWords = englishWords.map(addSynonyms).map(extractStem);

  const chineseWords = stopword.removeStopwords(
    tokens.filter((word) => /^[\u4e00-\u9fff]+$/.test(word)),
    stopword.zho
  ).map(addSynonyms);

  const result = [...chineseWords, ...processedEnglishWords];

  return result;
}

By this way we devise a tokenizer supports Chinese-English mixed text, it works well on all English text, all Chinese text, and Chinese texts mixed with a little English terms.

Japanese

Compared to English and Chinese, it's far troublesome to devise a Japanese Tokenizer for Japanese language feature of written system. Because Japanese text has two written forms: Kanji (漢字) and Kana (假名) so that many Japanese words could be written in both Kanji and Kana.

However, the Japanese syllable is very simple, and usually one kana corresponds to one syllable, so homonym situations are very serious. Therefore, in Japanese written texts, some Japanese words often need to be written as Chinese characters to distinguish them.

For example, these words have same Kana form (かん) but different meanings.

缶 (かん) - means the Jars.
観 (かん) - means "watching" and "observing".
漢 (かん) - means "Han nationality" or "Chinese Character".
幹 (かん) - means "cadre" or "mainstay".

This homographs situation sometimes will make confusion and comprehending mess. For example, here is a Japanese sentence「おはしをください」。In this sentence 「はし」 is a homograph with three irrelevant meanings.

箸（はし）: Chopstick.
橋（はし）: Bridge.
端（はし）: Edge, end of something.

From the perspective of grammar and common sense, this sentence is completely valid. However, without a specific context, it is impossible to determine what the speaker is requesting.

If the context is a restaurant, then「はし」could be chopsticks, and the sentence would mean "Please give me chopsticks".
If the context is asking for directions, then「はし」could be bridge, and the sentence would mean "Please tell me where the bridge is" or "Please take me to the bridge."
If the context is a physical location, such as the edge of a table, then「はし」could be edge, and the sentence would mean "Please pass something to the edge".

Only when with the specific context can the true meaning of "はし" in this sentence be determined. Judging from the grammar and literal meaning alone, all three meanings are valid and unrelated.

In my point of view, the mere tokenization itself is not true of Japanese. Japanese should be tokenized by "ideas" rather than "words".

Take the sentence mentioned above for example, if the context is in a restaurant,

お：A linker that expresses respect or politeness and is used to modify the following noun.
はし：A noun with meanings of "chopstick".
を：A particle that marks the object.
ください：The imperative form of the verb "下さい", used to express a request, meaning "please".

And it outputs a set of ideas Chopstick, Please as tokenization result. Fortunately, the Kanji in Japanese has its function for marking the ideas so we can perceive kanas as one same word if they have same Kanji form.

So I take these steps for Japanese Tokenization:

Separate sentences into components and groups of words by grammar rules.
Remove functional words and stopwords.
Revert every words to basic forms.
Convert Kana into ideographic kanji as much as possible, and outputs the ideas set as result.

The processing of synonyms, and Japanese-English mixed writing can also refer to the steps mentioned above, they are similar by and large.

The common Japanese tokenizer in community are Mecab, Sudachi and Kuromoji. But I have sift them through but they are outmode and in lack of maintain for years...

Just make do with them... if you do not have better alternative... (I want to complain the technical atmosphere in Japan by the way...It's hard to put my comments on Japanese opensource community...)

Epilogue

Of course, the tokenizer designed above is just a very simple implementation. Although it is very simple, it has all the necessary effects. However, it is sufficient for designing of search engine for personal blog. More advanced full-text search indexing features such as semantic deduction belong to the research realm of NLP in artificial intelligence. They have high requirements on device performance and training data. If it is a small site, it is really unnecessary for cost considerations.

MY BLOG

Getting Start with English

Support Chinese

Japanese

Epilogue