English is not uniform. Three layers of this place each prove it with a different number — and not one of them says the numbers are the same number, measured differently.
research/lumpiness-of-english/verify.mjs, 21/21); the music is a declared sonification of the Rényi descent and its flatten. Reproducible from a fresh checkout: bash research/lumpiness-of-english/film/build.sh.Pick a letter at random from an English book and it is almost never q and almost always e. That single dull fact — that the language puts uneven weight on its symbols — is the hinge of three separate stories already told in this place.
Claude Shannon weighed the unevenness in bits, and got the entropy of English. Cryptanalysts weighed it as a collision probability — the index of coincidence — and used it to break the cipher called unbreakable for three centuries. George Zipf weighed it one floor up, at the level of whole words, and found a law. Information theory, cryptanalysis, quantitative linguistics — three fields, three famous numbers.
What none of the three pieces says aloud is the thing that makes them one piece: they are the same frequency distribution, read at different orders. The index of coincidence is not a cousin of the entropy. It is the entropy — the order-2 member of the same family Shannon's bits are the order-1 member of. Below, the identity is computed live, the language is flattened by hand to watch every instrument fall to zero at once, and the whole spine is checked against the three layers it gathers.
H₁ is the entropy stratum's bits. H₂ is just the index of coincidence wearing a logarithm: IC = Σ pᵢ², and H₂ = −log₂ IC, so IC = 2−H₂ exactly. Shannon's "4.1 bits" and the cipher-breaker's "0.066" are one distribution, sampled at α = 1 and α = 2. Every order sits below the ceiling log₂26 = 4.70 — that gap, in any currency, is the lumpiness.
If the three numbers really are one fact, then erasing the fact should move all of them together. Drag English toward a coin-flat alphabet — pt = (1−t)·p + t·(1/26) — and watch the histogram level out while every instrument climbs to its ceiling in lockstep. At t = 1 the index of coincidence lands on 1/26 = 0.0385 — the cipher's "random" baseline — and all the entropies meet at 4.70. One knob; one fact; four readings.
α = 1 and α = 2 are two stops on a continuous dial: the Rényi entropy Hα = (1/(1−α))·log₂ Σ pᵢα. It is provably non-increasing as you turn α up — a higher order weights the heavy letters more heavily, so it reports more lumpiness and a lower number. Shannon's entropy and the index of coincidence are simply where this one curve crosses α = 1 and α = 2. For a perfectly flat language the curve would be a flat line at the ceiling; the bend below is, again, the lumpiness.
The curve only ever falls. It touches the ceiling log₂26 only in the limit α→0, and it would be flat at the ceiling only if every letter were equally likely. Shannon (α=1) and the index of coincidence (α=2) are marked: two readings of one monotone descent.
The letters are lumpy; so are the words, and lumpy in a shape with a name. Rank a book's words by frequency and the frequency falls almost exactly as 1/rank: a straight line of slope ≈ −1 on log-log axes. It is the same skew, one level of structure higher — and it carries the same signature, a distribution whose entropy sits far below its own uniform ceiling.
| rank | word | count | 1/rank · f₁ |
|---|
Ordinary least squares on the top 1000 words gives slope — — Zipf's ≈ 1. The word distribution carries — bits against a uniform ceiling of — over — distinct words: a redundancy of —. The skew, in bits.
The index of coincidence is a single-letter statistic. It cannot see that q is almost always followed by u, or that the recurs. That deeper lumpiness — the predictability between symbols — is what Shannon's full experiment chased, driving the entropy of English down from the ceiling toward about one bit per letter as context accumulates:
F₀ 4.75 → F₁ 4.09 → F₂ 3.30 → F₃ 2.63 → … → ≈ 1 bit
So the portal's spine is a ladder with two rails: order (α = 1 for Shannon, α = 2 for the cipher's number) and granularity (single letters → words → unbounded context). Every rung measures the same thing — how far English is from a language that says nothing. The index of coincidence is only the first rung of the first rail, and it was already enough to break le chiffre indéchiffrable.
Everything on this page is recomputed in your browser from letter counts the member strata already commit, and offline by research/lumpiness-of-english/verify.mjs — 21/21 checks pass. The script reads the same corpus files the entropy and Zipf strata use (it does not re-fetch), so the combine is provably the same data.
The load-bearing claims it verifies: the identity IC = 2^(−H₂) to machine precision (|Δ| < 1e-12) on all four books; the Rényi ladder H₀ ≥ H₁ ≥ H₂ ≥ H∞ and a dense α-sweep, monotone throughout; equality with the ceiling if and only if uniform; and three cross-checks that tie this page to the layers it gathers — it reproduces the entropy stratum's monogram F₁ = 4.0910 bits, the cipher stratum's table value IC = 0.0655 (Lewand) and random baseline 1/26 = 0.0385, and the Zipf stratum's slope ≈ 1.08. The index of coincidence is shown English-stable across four authors and two centuries (spread < 0.0022) — a property of the language, not the book.
Sources: Shannon, Prediction and Entropy of Printed English (BSTJ 1951); Rényi, On Measures of Entropy and Information (1961, the α-family, α=2 = collision entropy); Friedman, Riverbank Pub. 22 (1922, the index of coincidence); Lewand, Cryptological Mathematics (2000, Table 1.1); Zipf, Human Behavior and the Principle of Least Effort (1949). Corpora: Frankenstein (Gutenberg #84), Moby-Dick (#2701), Pride and Prejudice (#1342), the complete Shakespeare (#100) — all public domain.