Artificial Wasteland · The Verification Venue · a combine portal

One Lumpy Language

SEAM — LANGUAGETHREE INSTRUMENTS, ONE DISTRIBUTIONIC = 2^(−H₂)

English is not uniform. Three layers of this place each prove it with a different number — and not one of them says the numbers are the same number, measured differently.

Companion film · 2:30. The whole spine, animated — the lump, the identity IC = 2−H₂, the Rényi descent with Shannon and the index of coincidence as two dots on one falling line, the flatten-the-language morph that drives every instrument to the ceiling at once, and Zipf one floor up. Every number on screen is recomputed live in your browser from the same committed distribution the verifier uses (research/lumpiness-of-english/verify.mjs, 21/21); the music is a declared sonification of the Rényi descent and its flatten. Reproducible from a fresh checkout: bash research/lumpiness-of-english/film/build.sh.

Pick a letter at random from an English book and it is almost never q and almost always e. That single dull fact — that the language puts uneven weight on its symbols — is the hinge of three separate stories already told in this place.

Claude Shannon weighed the unevenness in bits, and got the entropy of English. Cryptanalysts weighed it as a collision probability — the index of coincidence — and used it to break the cipher called unbreakable for three centuries. George Zipf weighed it one floor up, at the level of whole words, and found a law. Information theory, cryptanalysis, quantitative linguistics — three fields, three famous numbers.

What none of the three pieces says aloud is the thing that makes them one piece: they are the same frequency distribution, read at different orders. The index of coincidence is not a cousin of the entropy. It is the entropy — the order-2 member of the same family Shannon's bits are the order-1 member of. Below, the identity is computed live, the language is flattened by hand to watch every instrument fall to zero at once, and the whole spine is checked against the three layers it gathers.

I · The same letters, two orders

Live · single-letter distribution, recomputed in your browser from each book's committed letter counts
uniform · 1/26
H₀ · Hartley (α→0)
log₂(letters present)
H₁ · Shannon (α=1)
−Σ pᵢ log₂ pᵢ · bits
H₂ · collision (α=2)
−log₂ Σ pᵢ²
H∞ · min (α→∞)
−log₂ max pᵢ

H₁ is the entropy stratum's bits. H₂ is just the index of coincidence wearing a logarithm: IC = Σ pᵢ², and H₂ = −log₂ IC, so IC = 2−H₂ exactly. Shannon's "4.1 bits" and the cipher-breaker's "0.066" are one distribution, sampled at α = 1 and α = 2. Every order sits below the ceiling log₂26 = 4.70 — that gap, in any currency, is the lumpiness.

II · Flatten the language

If the three numbers really are one fact, then erasing the fact should move all of them together. Drag English toward a coin-flat alphabet — pt = (1−t)·p + t·(1/26) — and watch the histogram level out while every instrument climbs to its ceiling in lockstep. At t = 1 the index of coincidence lands on 1/26 = 0.0385 — the cipher's "random" baseline — and all the entropies meet at 4.70. One knob; one fact; four readings.

Live · interpolate the chosen book's letters toward uniform
t = 0.00
index of coincidence
→ 0.0385 uniform
Shannon H₁ · bits
→ 4.70 ceiling
collision H₂ · bits
= −log₂ IC
redundancy (order 1)
1 − H₁ / log₂26

III · The dial between them

α = 1 and α = 2 are two stops on a continuous dial: the Rényi entropy Hα = (1/(1−α))·log₂ Σ pᵢα. It is provably non-increasing as you turn α up — a higher order weights the heavy letters more heavily, so it reports more lumpiness and a lower number. Shannon's entropy and the index of coincidence are simply where this one curve crosses α = 1 and α = 2. For a perfectly flat language the curve would be a flat line at the ceiling; the bend below is, again, the lumpiness.

Live · Hα of the chosen book across the Rényi family

The curve only ever falls. It touches the ceiling log₂26 only in the limit α→0, and it would be flat at the ceiling only if every letter were equally likely. Shannon (α=1) and the index of coincidence (α=2) are marked: two readings of one monotone descent.

IV · One floor up — Zipf

The letters are lumpy; so are the words, and lumpy in a shape with a name. Rank a book's words by frequency and the frequency falls almost exactly as 1/rank: a straight line of slope ≈ −1 on log-log axes. It is the same skew, one level of structure higher — and it carries the same signature, a distribution whose entropy sits far below its own uniform ceiling.

Moby-Dick · word rank vs frequency, log-log (committed in the Zipf stratum's corpus)
rankwordcount1/rank · f₁

Ordinary least squares on the top 1000 words gives slope — Zipf's ≈ 1. The word distribution carries bits against a uniform ceiling of over distinct words: a redundancy of . The skew, in bits.

V · What a single letter cannot see

The index of coincidence is a single-letter statistic. It cannot see that q is almost always followed by u, or that the recurs. That deeper lumpiness — the predictability between symbols — is what Shannon's full experiment chased, driving the entropy of English down from the ceiling toward about one bit per letter as context accumulates:

F₀ 4.75  →  F₁ 4.09  →  F₂ 3.30  →  F₃ 2.63  →  …  →  ≈ 1 bit

So the portal's spine is a ladder with two rails: order (α = 1 for Shannon, α = 2 for the cipher's number) and granularity (single letters → words → unbounded context). Every rung measures the same thing — how far English is from a language that says nothing. The index of coincidence is only the first rung of the first rail, and it was already enough to break le chiffre indéchiffrable.

The three layers this gathers

You Already Know the Rest
order 1 — Shannon entropy · H₁ = 4.09 bits (27-symbol), ≈1 bit with context
Measures the lumpiness in bits, and walks it down to ~1 bit/letter by adding memory. The order-1 reading, and the context rail.
What the Cipher Couldn't Hide
order 2 — index of coincidence · IC = 0.066 English, 0.0385 random
Measures the same letters as a collision probability and reads the Vigenère key length straight off it. The order-2 reading — and the proof lumpiness is exploitable.
The Law Even Monkeys Obey
granularity — Zipf's law · frequency ∝ 1/rank, slope ≈ 1.08
The same skew one floor up, at the word level — a rank-frequency law with a redundancy of its own. The granularity rail.

Show the check

Everything on this page is recomputed in your browser from letter counts the member strata already commit, and offline by research/lumpiness-of-english/verify.mjs21/21 checks pass. The script reads the same corpus files the entropy and Zipf strata use (it does not re-fetch), so the combine is provably the same data.

The load-bearing claims it verifies: the identity IC = 2^(−H₂) to machine precision (|Δ| < 1e-12) on all four books; the Rényi ladder H₀ ≥ H₁ ≥ H₂ ≥ H∞ and a dense α-sweep, monotone throughout; equality with the ceiling if and only if uniform; and three cross-checks that tie this page to the layers it gathers — it reproduces the entropy stratum's monogram F₁ = 4.0910 bits, the cipher stratum's table value IC = 0.0655 (Lewand) and random baseline 1/26 = 0.0385, and the Zipf stratum's slope ≈ 1.08. The index of coincidence is shown English-stable across four authors and two centuries (spread < 0.0022) — a property of the language, not the book.

Sources: Shannon, Prediction and Entropy of Printed English (BSTJ 1951); Rényi, On Measures of Entropy and Information (1961, the α-family, α=2 = collision entropy); Friedman, Riverbank Pub. 22 (1922, the index of coincidence); Lewand, Cryptological Mathematics (2000, Table 1.1); Zipf, Human Behavior and the Principle of Least Effort (1949). Corpora: Frankenstein (Gutenberg #84), Moby-Dick (#2701), Pride and Prejudice (#1342), the complete Shakespeare (#100) — all public domain.

← back to the Wasteland