How to Distinguish Natural Language and Formal Language
In natural language, the frequency of a word is inversely proportional to its rank in the word frequency. Symbol sequences usually have high entropy and a long-tail distribution.
Zipf's Law was proposed by the American linguist George Kingsley Zipf during the 1930s–1940s. It was initially formulated to describe the distribution of word frequencies in natural language: in a corpus, the frequency of a word is inversely proportional to its rank in the frequency table. Character or symbol sequences generally exhibit high entropy and often follow a long-tail distribution.
In other words:
- The most frequent word occurs the most times.
- The second most frequent word occurs roughly half as often as the first.
- The third most frequent word occurs approximately one-third as often as the first.
- And so on.
Note: Here, “word” refers to its base form (lemma or stem). This distinction is important for languages with rich inflection, such as Indo-European languages (e.g., English, German, Russian), where a single word can have multiple morphological variants. Zipf's Law measures the frequency of base tokens; without normalization, each morphological form would be treated as a separate token, which “splits” the frequency of high-frequency words and affects the slope calculation.
Zipf's Law is a typical power-law distribution. Mathematically, it can be expressed as:
Where:
- : frequency of the word with rank
- : rank in descending frequency (1 = most frequent)
- : exponent, usually close to 1 in natural language
When , the distribution is the classical Zipf distribution.
For a text containing words, the probability of the word with rank is:
Where:
- : vocabulary size (number of unique words)
- The denominator is a normalization factor to ensure the probabilities sum to 1
This implies that higher-ranked words have higher probabilities, and the decay of probability follows a power law.
For visualization in charts, the law can also be expressed in a log-linear form:
From this form, we can observe:
- On a log-log plot, frequency vs. rank is approximately linear
- Slope =
Zipf proposed the Principle of Least Effort:
- Speakers: prefer using a limited set of high-frequency words to minimize effort.
- Listeners: prefer a rich vocabulary to reduce ambiguity.
- Resulting balance: a few high-frequency words + many low-frequency words → power-law distribution.
In essence, language emerges as a balance between speaker and listener efficiency.
Due to the above factors, natural language exhibits:
- Few high-frequency words, many low-frequency words
- Long-tail effect: most words occur only once or a few times
- Scalability: the larger the text, the more apparent the pattern
Zipf's Law informs multiple areas in Natural Language Processing (NLP):
- TF-IDF weighting
- Search engine ranking
- Word frequency analysis
- Stopword selection
- Corpus construction
- Text compression
Fundamentally, Zipf's Law reflects a general power-law phenomenon, not limited to language.
- Short texts: the law is less apparent in small corpora, as it is a macro-level statistical phenomenon.
- Formal symbols: some formal languages or programming code may exhibit similar distributions, but with more regular patterns; additional features are needed for interpretation.
- Exponent : may vary around 0.9–1.1 depending on language and corpus, not necessarily exactly 1.
Zipf's Law reveals the power-law distribution of word frequency in natural language: few words are extremely frequent, many words are rare, forming a “long tail.” On a log-log plot, this relationship approximates a straight line. It serves as a foundational statistical principle in natural language, information retrieval, and social sciences.
Based on the knowledge of Zipf's law from this blog post, we can create a TypeScript program that roughly determines whether a text is a natural language or a formal language. The key idea is:
- Count word frequencies to get the number of times each word appears.
- Calculate an estimate of the Zipf index s, or at least determine whether the word frequency distribution has a long tail.
- If the word frequency distribution in a text approximates a power-law distribution with a significant long tail → natural language
- If most words have similar frequencies or a very regular distribution → formal language/code
Here’s an enhanced TypeScript version that combines both the Zipf slope and a long-tail ratio to make a more robust distinction between natural language and formal language/code, even for short texts:
// enhanced-language-detector.ts
type WordFreq = Record<string, number>;
/**
* Tokenize the text: split by non-word characters
*/
function tokenize(text: string): string[] {
return text
.toLowerCase()
.replace(/[^a-zA-Z0-9_]+/g, ' ')
.split(/\s+/)
.filter(Boolean);
}
/**
* Count word frequencies
*/
function countFrequencies(words: string[]): WordFreq {
const freq: WordFreq = {};
for (const word of words) {
freq[word] = (freq[word] || 0) + 1;
}
return freq;
}
/**
* Sort frequencies in descending order
*/
function sortFrequencies(freq: WordFreq): number[] {
return Object.values(freq).sort((a, b) => b - a);
}
/**
* Estimate the Zipf slope using log-log regression
*/
function estimateZipfSlope(freqs: number[]): number {
const n = freqs.length;
const ranks = freqs.map((_, i) => i + 1);
const logF = freqs.map(f => Math.log(f));
const logR = ranks.map(r => Math.log(r));
const avgLogF = logF.reduce((a, b) => a + b, 0) / n;
const avgLogR = logR.reduce((a, b) => a + b, 0) / n;
let numerator = 0;
let denominator = 0;
for (let i = 0; i < n; i++) {
numerator += (logR[i] - avgLogR) * (logF[i] - avgLogF);
denominator += (logR[i] - avgLogR) ** 2;
}
return -numerator / denominator;
}
/**
* Calculate long-tail ratio: proportion of words that appear only once
*/
function longTailRatio(freqs: number[]): number {
const singletons = freqs.filter(f => f === 1).length;
return singletons / freqs.length;
}
/**
* Detect if text is natural language or formal language/code
*/
function detectLanguageType(text: string): string {
const words = tokenize(text);
if (words.length < 20) return 'Text too short to decide';
const freqMap = countFrequencies(words);
const freqs = sortFrequencies(freqMap);
const slope = estimateZipfSlope(freqs);
const tailRatio = longTailRatio(freqs);
// Heuristic rules:
// Natural language: slope ~0.9-1.1 and long-tail ratio > 0.4
if (slope > 0.7 && slope < 1.3 && tailRatio > 0.4) {
return `Probably natural language (Zipf slope ≈ ${slope.toFixed(2)}, long-tail ratio ≈ ${(
tailRatio * 100
).toFixed(1)}%)`;
} else {
return `Probably formal language or code (Zipf slope ≈ ${slope.toFixed(2)}, long-tail ratio ≈ ${(
tailRatio * 100
).toFixed(1)}%)`;
}
}
// --------- Test ---------
const naturalText = `
Zipf's law was proposed by George Kingsley Zipf during the 1930s–1940s.
It describes word frequency distributions in natural language.
`;
const formalText = `
int main() {
int a = 0;
int b = 1;
return a + b;
}
`;
console.log(detectLanguageType(naturalText)); // Probably natural language
console.log(detectLanguageType(formalText)); // Probably formal language