⌂ Artificial Wasteland  ·  an immersive stratum

The Tanks That Counted Themselves

N̂ = m + (m−k)/k·MLE bias = −(N−k)/(k+1)·edge = (k+2)/3·Ruggles & Brodie 1947

In 1940 the Allies needed a number they could not see: how many tanks was Germany building each month? Their spies guessed four to eight times too high. A handful of statisticians read the serial numbers off the captured tanks — and were right.

In the summer of 1940 the Western Allies needed a number they could not see: how many tanks was Germany actually building each month? Their intelligence services, working from agent reports and captured documents, answered with figures that turned out to be four to eight times too high. A different answer came from a small group of economists and statisticians, who did not ask the spies anything. They read the serial numbers stamped on the tanks the army captured and destroyed — gearboxes, chassis, engines, the moulds that cast the road wheels — and from those numbers estimated the whole. After the war the German production ledgers were opened, and it was the statisticians, not the spies, who had been close. This layer is that method, worked end to end, with the real wartime figures recomputed beside it — and made playable, so you can draw a month of captured serials and watch the estimate land.

I · The question you cannot ask directly

You are an Allied analyst in 1941. Somewhere in the Reich a factory is turning out tanks, each stamped at assembly with a serial number. You will never see the factory, the ledger, or most of the tanks. What you will see, over the months, is a scatter of captured and knocked-out machines, and you can read the numbers off them. From that handful of numbers you must estimate a quantity nobody on your side knows: how many were made.

Strip it to the model that makes it solvable. Suppose a month's tanks are numbered 1, 2, …, N with N unknown, and that the ones you observe are a fair sample of distinct serials drawn from that run. You see k of them; the largest is m. Estimate N.

It sounds like there is nothing to work with — you have a few numbers and want the size of a set you have mostly never seen. But the numbers are not arbitrary. They are consecutive, assigned in order by the enemy's own bookkeeping, and that order is a gift. It means the serials you hold are spread through the whole run, and the spacing between them is itself a measurement. The serial number, meant only to track a part through a factory, quietly carries the size of the factory's output. The tanks count themselves.

Below is one captured month, the same one the argument follows: k = 15 serials drawn from a run whose true size only the postwar ledgers would reveal. Draw a fresh month whenever you like, or change how many were built and how many you caught. Everything in the three instruments recomputes in your browser by the same formulas the offline verifier runs.

Instrument I · read the serials
built N caught k
largest seen m (the MLE) max + average gap (MVUE) twice the mean true N
What you are watching. The grey ticks are the serials you captured; the dark one is the largest, m. The biggest number you have seen is your crudest guess — and it can never exceed the truth, so it always leans low. The oxide marker adds one average gap past m to reach for the line you can't see; on a typical month it lands within a few percent of the teal truth. The blue marker — twice the average of all your serials — is also honest on average but jumps around far more, as the next instrument makes visible.

II · The obvious guess is wrong, and wrong in a knowable way

The first instinct is to say: I have seen number 243, so there were at least 243; my best single guess is 243. This is, in fact, the maximum-likelihood estimate — the value of N under which your exact sample is most probable is N = m, because any larger N only spreads the probability thinner. It is also obviously an underestimate, and not by a little. You did not happen to capture the very last tank off the line; the true maximum is almost certainly above the largest you saw. The MLE can never exceed what you have observed, and the truth almost always does.

The good news is that this bias is not a vague worry — it is an exact, computable quantity. Over all possible samples of size k from 1…N, the expected value of the sample maximum is

E[max] = k · (N + 1) / (k + 1).

For N = 270, k = 15 that is 254.06 — so on average the largest serial you see falls about sixteen short of the truth. The bias is exactly

E[max] − N = −(N − k) / (k + 1) = −15.94 here,

and a Monte-Carlo run of 400,000 simulated months lands the average observed maximum at 254.07, dead on the formula. The biggest number you've seen is a biased estimator, and we know the size and the sign of the lie: it always leans low, by about (N − k)/(k + 1). But an error you can write down is an error you can subtract off. That is the whole move.

III · The one correction — max plus the average gap

Reverse the bias formula. If the observed maximum runs low by a factor of k/(k+1), then scaling it back up and trimming the constant gives an estimator with the bias removed:

N̂ = m · (k + 1) / k − 1.

There is a second way to write the very same expression that says, in plain English, what it is doing:

N̂ = m + (m − k) / k = (the largest serial) + (the average gap between serials).

You have seen k serials, the biggest being m; that means m − k of the numbers at or below m are ones you didn't see, spread as gaps among the k you did — an average gap of (m − k)/k. The line you can see ends at m; the estimator adds one more average gap to reach for the line you can't. On the worked month — m = 243, average gap (243 − 15)/15 = 15.2 — that is N̂ = 243 + 15.2 = 258.2 against a truth of 270. And crucially it is unbiased: average it over all possible samples and it lands exactly on N, with no systematic lean. The 400,000-month simulation confirms it: mean estimate 270.00 against a true 270. The correction is not a fudge; it is the exact inverse of a known bias.

This is the estimator the Allied analysts used, and it is the minimum-variance unbiased estimator of N — the provably best one of its kind. That last claim is the subject of the next section, because "unbiased" alone is a much weaker virtue than it sounds, and the real reason this estimator won is hiding inside the word minimum-variance.

IV · Why this estimator, and not a cleverer-looking one

Unbiasedness is cheap. Here is a different unbiased estimator that looks more sophisticated — it uses all the data, not just the maximum. The sample mean of a run 1…N has expectation (N + 1)/2, so Ñ = 2 · (sample mean) − 1 is also exactly unbiased. It "feels" better: surely using all fifteen numbers beats leaning on one of them? On the worked month it gives 2 × 116.6 − 1 = 232.2 — further from the truth than the maximum-based estimate, and that is no accident.

The thing that separates a good estimator from a merely-unbiased one is its variance — how much it jumps around from sample to sample. The maximum-based MVUE has variance (N − k)(N + 1) / [k(k + 2)], which for the worked numbers is exactly 271, a standard deviation of 16.5. The "twice the mean" estimator has, for sampling without replacement, the variance (N + 1)(N − k) / (3k) — here 1535.7, a standard deviation of 39.2. Divide one variance by the other and almost everything cancels:

Var(twice the mean) / Var(N̂) = (k + 2) / 3.

The ratio depends on nothing but k — not on N, not on how many tanks were built, only on how many you caught. At k = 15 it is exactly 17/3 = 5.667: the estimator that throws away everything but the maximum has under a sixth the mean-squared error of the one that averages all the evidence. (The Monte-Carlo run lands on 5.65, the closed form's sampling shadow.) Drag the sample size below and watch the exact ratio move; it is one of those rare places where the messy-looking answer is a clean fraction.

Instrument II · a thousand months — bias and variance
built N caught k
How to read it. Each curve is where one estimator's guesses pile up over the months you've run. The dark maximum curve sits visibly left of the teal truth line — biased low by exactly (N−k)/(k+1). The oxide max + gap and the blue twice the mean are both centred on the truth — both unbiased — but the blue one is a low, wide smear and the oxide one a tall, tight spike. Their mean-squared errors stand in the ratio (k+2)/3, recomputed exact above and confirmed by the run. One number — the largest serial, corrected — beats the average of all of them, and beats it badly.

The reason is structural. For a uniform run, the sample maximum is a sufficient statistic: it carries every drop of information the sample holds about N, and the other serials add nothing once you know the largest. The mean, by contrast, throws away exactly the information that matters — the upper edge — and pays for it in variance. This is the quiet lesson the analysts were living: against a uniform run, the right move is not to average your evidence but to take its extreme and correct it. The biggest serial you have seen is almost the whole story; the average gap finishes it.

V · The Bayesian reads it the same way, and hands you an interval

A frequentist returns a single best estimate. A Bayesian returns the whole shape of what the data permit — and on this problem the two agree, which is itself a small reassurance that neither is doing something strange.

Put a flat prior over every N ≥ m (every production figure at least as large as the biggest tank you saw is, a priori, equally plausible). The likelihood of having drawn your particular sample given N is 1 / C(N, k) — one over the number of ways to choose k serials from N — so the posterior is proportional to 1/C(N, k), peaking at N = m and decaying as N grows past it. The posterior mean has a clean closed form,

E[N | m, k] = (m − 1)(k − 1) / (k − 2),

which for the worked month gives 260.6. The frequentist's 258.2 and the Bayesian's 260.6 sit within a percent of each other, both a little under the true 270. The Bayesian's bonus is the part a point estimate cannot give: a 95% credible interval, read straight off the posterior — for the worked month, [243, 313], with the truth sitting comfortably inside. The data don't just say "about 260"; they say "almost surely between 243 and 313, and here is how the plausibility is distributed across that range." From fifteen serial numbers.

Instrument III · the posterior over N
This posterior is the current month in Instrument I. Draw a new month up top and this reshapes. The curve is the plausibility of each possible production figure N given your serials; it can't drop below the largest tank you saw (left edge = m) and falls off like N−k above it. The shaded band is the central 95%; the gold line is the posterior mean; the teal line is the truth the ledgers later revealed. The interval is honest about its own width — catch fewer tanks and watch it yawn open.

VI · The month the spies were wrong

None of this would be more than a tidy exercise if the German records had stayed sealed. They did not. After 1945 the Allies recovered the production ledgers of the Reich's war economy, and could finally mark their wartime guesses against the truth. The serial-number statisticians — their method written up in 1947 by Richard Ruggles and Henry Brodie in the Journal of the American Statistical Association — turned out to have been startlingly close, where conventional intelligence had been wildly, consistently high. These three months are cited data; only the error columns are recomputed here, live:

The wartime record · Ruggles & Brodie 1947
MonthStatisticalIntelligenceGerman recordsStat. errorIntel. overshoot
Read the last two columns. Across these months the statistical estimate missed by about a fifth; the intelligence estimate overshot by a multiple. The most-cited single case is sharper still: from the road wheels of just two Panther tanks — the wheels were cast in numbered moulds, and counting the moulds, then the wheels per mould, then the tanks — the method estimated 270 Panthers built in February 1944. The German records gave 276: a miss of 2.2%, from the wheels of two tanks.

VII · The seam — where the honesty lives

This is a clean, almost magical-feeling result, which is exactly the kind that needs its assumptions read aloud rather than smoothed away. Five places the method strains, named:

  1. The clean model is an idealization the analysts had to earn. "Serials run 1…N, one tidy sequence, sampled uniformly" is the schoolbook version. Reality was messier: German production authorities handed manufacturers blocks of serial numbers (often in hundreds), filled them incompletely, and stamped several independent number-series on each tank — chassis, engine, gearbox, the road-wheel moulds — each with its own range. The real work was forensic: reconstruct how the enemy's bookkeeping actually assigned numbers, reconcile the part-series, and only then apply the estimator. The formula is the easy half; figuring out what counts as N was the hard half.
  2. The method is not magic, and one row shows it. June 1940's statistical estimate (169) was 38.5% too high against the true 122 — a real, sizeable miss, left in the table rather than dropped. The honest claim is not "the statistics were always nearly exact" but "the statistics were systematically far closer than intelligence, in every month, often within tens." That is a strong claim and a true one; the stronger claim would be false.
  3. Unbiased is not the prize; low variance is. Section IV's point bears repeating because it is the one most retellings skip: the maximum-based estimator did not win for being unbiased — "twice the mean" is unbiased too. It won for having a fraction of the variance — exactly (k+2)/3 less — because the sample maximum is sufficient and the mean is not. Selling the method as "the magic formula" misses the actual statistical content.
  4. The whole thing depends on the enemy's serial numbers being a gift. Consecutive, centrally assigned numbering is what makes the largest serial informative. An adversary who randomized serials, or restarted numbering per factory without a recoverable scheme, or simply stopped stamping sequential numbers, would blunt the method badly. The technique reads the enemy's administrative order; against disorder it has far less to read.
  5. The history here is cited, not recomputed. The 1947 production table and the Panther road-wheel figures are reported from the literature; the page stores them and computes only the error percentages. What it does recompute from scratch — the bias of the maximum, the unbiasedness and variance of the MVUE, its exact (k+2)/3 edge over the mean estimator, the Bayesian posterior and its interval — runs on synthetic months whose true N it controls, so no claim about the model rests on getting a historical transcription right. The two kinds of fact are kept apart on purpose.

And the line under all of it: the question "how many were built?" looks unanswerable from a handful of captured machines, and is not, because the enemy wrote the answer on every tank without knowing it. The serial number is administrative exhaust — a clerk's tracking mark — and it turned out to carry, in the spacing between the few you could catch, the size of the whole. The deepest moat of a real-world statistic is that it reads a signal the adversary did not know he was sending.

The check
Run it yourself: node research/the-tanks-that-counted-themselves/verify.mjs — exact order-statistics arithmetic, a 400,000-trial Monte Carlo, the closed-form variances, the Bayesian posterior summed to convergence, and the recomputed wartime error columns. 23/23 checks pass. Every estimate, histogram, posterior and percentage on this page is recomputed live by the same formulas.

Sources

Ruggles, R. & Brodie, H. (1947). An Empirical Approach to Economic Intelligence in World War II. Journal of the American Statistical Association 42(237):72–91 — the wartime serial-number method and the production-comparison table.
The estimator. The minimum-variance unbiased estimator N̂ = m(k+1)/k − 1, its variance (N−k)(N+1)/[k(k+2)], and the Bayesian posterior mean (m−1)(k−1)/(k−2) are standard results in the theory of order statistics and point estimation; see any mathematical-statistics text under "German tank problem" / "estimating the maximum of a discrete uniform distribution." The closed-form efficiency (k+2)/3 over the twice-the-mean estimator, and the without-replacement variance (N+1)(N−k)/(3k) it rests on, are derived and checked in this stratum's verifier.
The Panther road-wheel example (≈270 estimated vs. 276 recorded for February 1944) and the postwar finding that Allied statistical estimates outperformed both intelligence estimates and Germany's own figures are as reported in the standard secondary accounts of Ruggles & Brodie's study.