How to Distinguish Natural Language and Formal Language

    The ZIPF Statistical Theory for Symbol Text.
    28 Aug, 2025

    In natural language, the frequency of a word is inversely proportional to its rank in the word frequency. Symbol sequences usually have high entropy and a long-tail distribution.

    Zipf's Law was proposed by the American linguist George Kingsley Zipf during the 1930s–1940s. It was initially formulated to describe the distribution of word frequencies in natural language: in a corpus, the frequency of a word is inversely proportional to its rank in the frequency table. Character or symbol sequences generally exhibit high entropy and often follow a long-tail distribution.

    In other words:

    • The most frequent word occurs the most times.
    • The second most frequent word occurs roughly half as often as the first.
    • The third most frequent word occurs approximately one-third as often as the first.
    • And so on.

    Note: Here, “word” refers to its base form (lemma or stem). This distinction is important for languages with rich inflection, such as Indo-European languages (e.g., English, German, Russian), where a single word can have multiple morphological variants. Zipf's Law measures the frequency of base tokens; without normalization, each morphological form would be treated as a separate token, which “splits” the frequency of high-frequency words and affects the slope calculation.

    Zipf's Law is a typical power-law distribution. Mathematically, it can be expressed as:

    f(r)1rsf(r) \propto \frac{1}{r^s}

    Where:

    • f(r)f(r) : frequency of the word with rank rr
    • rr : rank in descending frequency (1 = most frequent)
    • ss : exponent, usually close to 1 in natural language

    When s=1s = 1, the distribution is the classical Zipf distribution.

    For a text containing NN words, the probability of the word with rank rr is:

    P(r)=1rsk=1V1ksP(r) = \frac{\frac{1}{r^s}}{\sum_{k=1}^{V} \frac{1}{k^s}}

    Where:

    • VV : vocabulary size (number of unique words)
    • The denominator is a normalization factor to ensure the probabilities sum to 1

    This implies that higher-ranked words have higher probabilities, and the decay of probability follows a power law.

    For visualization in charts, the law can also be expressed in a log-linear form:

    logf(r)=slogr+C\log f(r) = -s \log r + C

    From this form, we can observe:

    • On a log-log plot, frequency f(r)f(r) vs. rank rr is approximately linear
    • Slope = s-s

    Zipf proposed the Principle of Least Effort:

    1. Speakers: prefer using a limited set of high-frequency words to minimize effort.
    2. Listeners: prefer a rich vocabulary to reduce ambiguity.
    3. Resulting balance: a few high-frequency words + many low-frequency words → power-law distribution.

    In essence, language emerges as a balance between speaker and listener efficiency.

    Due to the above factors, natural language exhibits:

    • Few high-frequency words, many low-frequency words
    • Long-tail effect: most words occur only once or a few times
    • Scalability: the larger the text, the more apparent the pattern

    Zipf's Law informs multiple areas in Natural Language Processing (NLP):

    • TF-IDF weighting
    • Search engine ranking
    • Word frequency analysis
    • Stopword selection
    • Corpus construction
    • Text compression

    Fundamentally, Zipf's Law reflects a general power-law phenomenon, not limited to language.

    1. Short texts: the law is less apparent in small corpora, as it is a macro-level statistical phenomenon.
    2. Formal symbols: some formal languages or programming code may exhibit similar distributions, but with more regular patterns; additional features are needed for interpretation.
    3. Exponent ss: may vary around 0.9–1.1 depending on language and corpus, not necessarily exactly 1.

    Zipf's Law reveals the power-law distribution of word frequency in natural language: few words are extremely frequent, many words are rare, forming a “long tail.” On a log-log plot, this relationship approximates a straight line. It serves as a foundational statistical principle in natural language, information retrieval, and social sciences.

    Based on the knowledge of Zipf's law from this blog post, we can create a TypeScript program that roughly determines whether a text is a natural language or a formal language. The key idea is:

    • Count word frequencies to get the number of times each word appears.
    • Calculate an estimate of the Zipf index s, or at least determine whether the word frequency distribution has a long tail.
    • If the word frequency distribution in a text approximates a power-law distribution with a significant long tail → natural language
    • If most words have similar frequencies or a very regular distribution → formal language/code

    Here’s an enhanced TypeScript version that combines both the Zipf slope and a long-tail ratio to make a more robust distinction between natural language and formal language/code, even for short texts:

    // enhanced-language-detector.ts
    
    type WordFreq = Record<string, number>;
    
    /**
     * Tokenize the text: split by non-word characters
     */
    function tokenize(text: string): string[] {
      return text
        .toLowerCase()
        .replace(/[^a-zA-Z0-9_]+/g, ' ')
        .split(/\s+/)
        .filter(Boolean);
    }
    
    /**
     * Count word frequencies
     */
    function countFrequencies(words: string[]): WordFreq {
      const freq: WordFreq = {};
      for (const word of words) {
        freq[word] = (freq[word] || 0) + 1;
      }
      return freq;
    }
    
    /**
     * Sort frequencies in descending order
     */
    function sortFrequencies(freq: WordFreq): number[] {
      return Object.values(freq).sort((a, b) => b - a);
    }
    
    /**
     * Estimate the Zipf slope using log-log regression
     */
    function estimateZipfSlope(freqs: number[]): number {
      const n = freqs.length;
      const ranks = freqs.map((_, i) => i + 1);
      const logF = freqs.map(f => Math.log(f));
      const logR = ranks.map(r => Math.log(r));
    
      const avgLogF = logF.reduce((a, b) => a + b, 0) / n;
      const avgLogR = logR.reduce((a, b) => a + b, 0) / n;
    
      let numerator = 0;
      let denominator = 0;
      for (let i = 0; i < n; i++) {
        numerator += (logR[i] - avgLogR) * (logF[i] - avgLogF);
        denominator += (logR[i] - avgLogR) ** 2;
      }
    
      return -numerator / denominator;
    }
    
    /**
     * Calculate long-tail ratio: proportion of words that appear only once
     */
    function longTailRatio(freqs: number[]): number {
      const singletons = freqs.filter(f => f === 1).length;
      return singletons / freqs.length;
    }
    
    /**
     * Detect if text is natural language or formal language/code
     */
    function detectLanguageType(text: string): string {
      const words = tokenize(text);
      if (words.length < 20) return 'Text too short to decide';
    
      const freqMap = countFrequencies(words);
      const freqs = sortFrequencies(freqMap);
    
      const slope = estimateZipfSlope(freqs);
      const tailRatio = longTailRatio(freqs);
    
      // Heuristic rules:
      // Natural language: slope ~0.9-1.1 and long-tail ratio > 0.4
      if (slope > 0.7 && slope < 1.3 && tailRatio > 0.4) {
        return `Probably natural language (Zipf slope ≈ ${slope.toFixed(2)}, long-tail ratio ≈ ${(
          tailRatio * 100
        ).toFixed(1)}%)`;
      } else {
        return `Probably formal language or code (Zipf slope ≈ ${slope.toFixed(2)}, long-tail ratio ≈ ${(
          tailRatio * 100
        ).toFixed(1)}%)`;
      }
    }
    
    // --------- Test ---------
    const naturalText = `
    Zipf's law was proposed by George Kingsley Zipf during the 1930s–1940s.
    It describes word frequency distributions in natural language.
    `;
    
    const formalText = `
    int main() {
      int a = 0;
      int b = 1;
      return a + b;
    }
    `;
    
    console.log(detectLanguageType(naturalText)); // Probably natural language
    console.log(detectLanguageType(formalText));  // Probably formal language