How to Distinguish Natural Language and Formal Language

The ZIPF Statistical Theory for Symbol Text.

28 Aug, 2025

In natural language, the frequency of a word is inversely proportional to its rank in the word frequency. Symbol sequences usually have high entropy and a long-tail distribution.

Zipf's Law was proposed by the American linguist George Kingsley Zipf during the 1930s–1940s. It was initially formulated to describe the distribution of word frequencies in natural language: in a corpus, the frequency of a word is inversely proportional to its rank in the frequency table. Character or symbol sequences generally exhibit high entropy and often follow a long-tail distribution.

In other words:

The most frequent word occurs the most times.
The second most frequent word occurs roughly half as often as the first.
The third most frequent word occurs approximately one-third as often as the first.
And so on.

Note: Here, “word” refers to its base form (lemma or stem). This distinction is important for languages with rich inflection, such as Indo-European languages (e.g., English, German, Russian), where a single word can have multiple morphological variants. Zipf's Law measures the frequency of base tokens; without normalization, each morphological form would be treated as a separate token, which “splits” the frequency of high-frequency words and affects the slope calculation.

Zipf's Law is a typical power-law distribution. Mathematically, it can be expressed as:

f(r) \propto \frac{1}{r^s}

Where:

$f(r)$ : frequency of the word with rank $r$
$r$ : rank in descending frequency (1 = most frequent)
$s$ : exponent, usually close to 1 in natural language

When $s = 1$ , the distribution is the classical Zipf distribution.

For a text containing $N$ words, the probability of the word with rank $r$ is:

P(r) = \frac{\frac{1}{r^s}}{\sum_{k=1}^{V} \frac{1}{k^s}}

Where:

$V$ : vocabulary size (number of unique words)
The denominator is a normalization factor to ensure the probabilities sum to 1

This implies that higher-ranked words have higher probabilities, and the decay of probability follows a power law.

For visualization in charts, the law can also be expressed in a log-linear form:

\log f(r) = -s \log r + C

From this form, we can observe:

On a log-log plot, frequency $f(r)$ vs. rank $r$ is approximately linear
Slope = $-s$

Zipf proposed the Principle of Least Effort:

Speakers: prefer using a limited set of high-frequency words to minimize effort.
Listeners: prefer a rich vocabulary to reduce ambiguity.
Resulting balance: a few high-frequency words + many low-frequency words → power-law distribution.

In essence, language emerges as a balance between speaker and listener efficiency.

Due to the above factors, natural language exhibits:

Few high-frequency words, many low-frequency words
Long-tail effect: most words occur only once or a few times
Scalability: the larger the text, the more apparent the pattern

Zipf's Law informs multiple areas in Natural Language Processing (NLP):

TF-IDF weighting
Search engine ranking
Word frequency analysis
Stopword selection
Corpus construction
Text compression

Fundamentally, Zipf's Law reflects a general power-law phenomenon, not limited to language.

Short texts: the law is less apparent in small corpora, as it is a macro-level statistical phenomenon.
Formal symbols: some formal languages or programming code may exhibit similar distributions, but with more regular patterns; additional features are needed for interpretation.
Exponent $s$ : may vary around 0.9–1.1 depending on language and corpus, not necessarily exactly 1.

Zipf's Law reveals the power-law distribution of word frequency in natural language: few words are extremely frequent, many words are rare, forming a “long tail.” On a log-log plot, this relationship approximates a straight line. It serves as a foundational statistical principle in natural language, information retrieval, and social sciences.

Based on the knowledge of Zipf's law from this blog post, we can create a TypeScript program that roughly determines whether a text is a natural language or a formal language. The key idea is:

Count word frequencies to get the number of times each word appears.
Calculate an estimate of the Zipf index s, or at least determine whether the word frequency distribution has a long tail.
If the word frequency distribution in a text approximates a power-law distribution with a significant long tail → natural language
If most words have similar frequencies or a very regular distribution → formal language/code

Here’s an enhanced TypeScript version that combines both the Zipf slope and a long-tail ratio to make a more robust distinction between natural language and formal language/code, even for short texts:

// enhanced-language-detector.ts

type WordFreq = Record<string, number>;

/**
 * Tokenize the text: split by non-word characters
 */
function tokenize(text: string): string[] {
  return text
    .toLowerCase()
    .replace(/[^a-zA-Z0-9_]+/g, ' ')
    .split(/\s+/)
    .filter(Boolean);
}

/**
 * Count word frequencies
 */
function countFrequencies(words: string[]): WordFreq {
  const freq: WordFreq = {};
  for (const word of words) {
    freq[word] = (freq[word] || 0) + 1;
  }
  return freq;
}

/**
 * Sort frequencies in descending order
 */
function sortFrequencies(freq: WordFreq): number[] {
  return Object.values(freq).sort((a, b) => b - a);
}

/**
 * Estimate the Zipf slope using log-log regression
 */
function estimateZipfSlope(freqs: number[]): number {
  const n = freqs.length;
  const ranks = freqs.map((_, i) => i + 1);
  const logF = freqs.map(f => Math.log(f));
  const logR = ranks.map(r => Math.log(r));

  const avgLogF = logF.reduce((a, b) => a + b, 0) / n;
  const avgLogR = logR.reduce((a, b) => a + b, 0) / n;

  let numerator = 0;
  let denominator = 0;
  for (let i = 0; i < n; i++) {
    numerator += (logR[i] - avgLogR) * (logF[i] - avgLogF);
    denominator += (logR[i] - avgLogR) ** 2;
  }

  return -numerator / denominator;
}

/**
 * Calculate long-tail ratio: proportion of words that appear only once
 */
function longTailRatio(freqs: number[]): number {
  const singletons = freqs.filter(f => f === 1).length;
  return singletons / freqs.length;
}

/**
 * Detect if text is natural language or formal language/code
 */
function detectLanguageType(text: string): string {
  const words = tokenize(text);
  if (words.length < 20) return 'Text too short to decide';

  const freqMap = countFrequencies(words);
  const freqs = sortFrequencies(freqMap);

  const slope = estimateZipfSlope(freqs);
  const tailRatio = longTailRatio(freqs);

  // Heuristic rules:
  // Natural language: slope ~0.9-1.1 and long-tail ratio > 0.4
  if (slope > 0.7 && slope < 1.3 && tailRatio > 0.4) {
    return `Probably natural language (Zipf slope ≈ ${slope.toFixed(2)}, long-tail ratio ≈ ${(
      tailRatio * 100
    ).toFixed(1)}%)`;
  } else {
    return `Probably formal language or code (Zipf slope ≈ ${slope.toFixed(2)}, long-tail ratio ≈ ${(
      tailRatio * 100
    ).toFixed(1)}%)`;
  }
}

// --------- Test ---------
const naturalText = `
Zipf's law was proposed by George Kingsley Zipf during the 1930s–1940s.
It describes word frequency distributions in natural language.
`;

const formalText = `
int main() {
  int a = 0;
  int b = 1;
  return a + b;
}
`;

console.log(detectLanguageType(naturalText)); // Probably natural language
console.log(detectLanguageType(formalText));  // Probably formal language

MY BLOG