In natural language, the frequency of a word is inversely proportional to its rank in the word frequency. Symbol sequences usually have high entropy and a long-tail distribution.
NLP
How to Distinguish Natural Language and Formal Language
The ZIPF Statistical Theory for Symbol Text.
28 Aug, 2025
NLP
Some Tips for Designing Tokenizer for Search Engine
Take Chinese, English and Japanese for Example
I'm planning to do some optimizations for full-text search engine for my blog site recently and enhance it with better Chinese and English language support, and there are some ideas flips through my mind.
18 Jan, 2025
PROJECT-PRACTICENLP