Were Mendel's Pea Results Too Good to Be True?

Brünn, in winter 1865. The friar Gregor Mendel reads the second half of a long paper to the local natural-history society: Versuche über Pflanzen-Hybriden — experiments on plant hybrids. He has been crossing strains of garden pea, seven characters at a time, for eight years; the cabbage-white moth has been his enemy, the abbey his lab. The paper buries, in tables of plant counts, the two laws of heredity that genetics will spend a century finding again.

The laws are right. Half of biology is downstream of them. But seventy-one years later, the statistician R. A. Fisher will look at the same tables and write the sentence above. Not the science: the numbers. They are too close to the predicted ratios. Far too close, for so much counting done in a garden, by a friar and (he hopes) a few quiet helpers, in seasons that came once a year.

The point of this page is not to take sides. The point is to let you run the check yourself. Below are Mendel's seven headline counts as he published them, his chi-squared computed live, and a small machine that will run his program a hundred thousand more times — honestly, in front of you — so you can see where his numbers sit in the distribution of what honest counting actually does.

I · The seven counts, as he published them

The headline tables of the 1866 paper are the seven monohybrid F₂ generations: heterozygous parents crossed to produce 2,003 to 8,023 offspring per character, classified by phenotype. The Mendelian prediction in each case is the same: three plants show the dominant trait, one shows the recessive — the famous 3:1 ratio. Here are Mendel's published counts, with the expected values for 3:1, and the chi-squared each cell contributes. The whole table recomputes from the same data in research/closer-than-chance/data.mjs; if Mendel's count for round seeds were off by one, every number in this table would change.

Instrument I · Mendel 1866 §3 — the seven F₂ ratiosrecomputed live

character	dom.	rec.	total	E(dom)	E(rec)	χ²

Sum the seven and you get a single combined chi-squared on 7 degrees of freedom. A chi-squared distribution with 7 degrees of freedom has an expected value of 7; honest counting should land near it, with a wide spread. Mendel's seven sum to …. The probability that an honest run lands at least this close to the predicted 3:1 is … — about …. Suggestive; not, on its own, damning.

II · A simulation, so you can see the suggestiveness

Numbers like "5% lower-tail" are abstract. What is not abstract is running the experiment. Below, a small Mersenne-like RNG flips Mendel's program over and over: for each of the seven characters, it draws the right number of offspring as honest Bernoullis with p = 3/4, computes the same chi-squared, and adds the result to the histogram. The red line marks Mendel's 2.14. Each click runs ten thousand more rounds; watch the curve fill in.

Instrument II · Monte Carlo — honest re-runs of the seven F₂ experimentsdf = 7

…

simulate

After a few thousand rounds the empirical fraction "as close as Mendel" should settle near five percent — that is what the chi-squared CDF says it ought to be, and the simulator agrees with it because it is doing what the chi-squared statistic is theoretically a summary of. One in twenty is the kind of luck you get if you walk into your garden on the right morning.

Now widen the lens.

III · Fisher 1936 — the whole program, not just the seven

The seven F₂ ratios are only the part of Mendel's paper that every textbook reprints. The paper itself has more: the dihybrid 9:3:3:1 (315, 101, 108, 32), a trihybrid where Mendel painstakingly tested F₃ progeny to sort 639 F₂ plants into twenty-seven genotypic classes, and a constancy test in which he grew the F₃ seedlings of 600 dominant F₂ plants to ask which were truly homozygous. Anthony Edwards (1986a) showed Mendel's full body of experiments can be cleanly decomposed into 84 independent binomial tests. Fisher had done the equivalent decomposition in 1936; he summed it and reported:

Fisher 1936, Table III · the headline figure

// observed combined chi-squared across the 84 independent tests
chi2_observed  = 41.6056
degrees_of_freedom = 84

// the chi-squared at df=84 has expected value 84, std. dev. ≈ 13.
// the upper-tail probability — what Fisher reports —
// is the probability that an honest run is at least as far above its expectation
// as Mendel's is below it.
P(X >= 41.6 | df=84)  = 0.99993     // ⇒ said the other way:
P(X <= 41.6 | df=84)  = 0.00003     // ≈ 7 in 100,000 honest runs

// recomputed in this browser from the regularized incomplete gamma:
engine.P_lower       = …

Seven in a hundred thousand is not a near miss. It is the venue's record-correcting mode pointed at the founding paper of a science.

What Fisher said — and never withdrew — is that the body of Mendel's published counts is too close to its predictions by a factor that no amount of "lucky garden" can explain. What Fisher carefully did not say is that Mendel cheated. Reread the epigraph at the top of this page: he attributes the closeness to "some assistant who knew too well what was expected." The accusation is real; its target is left vague on purpose. Fisher needed Mendel to be right.

IV · The smoking gun Fisher pulled out by hand

Inside the 84 lay one experiment that Fisher singled out and that no one has yet explained away. After the seven F₂ ratios, Mendel went back and tested 600 of his dominant-phenotype F₂ plants to see which were true-breeding. The expected ratio of homozygous to heterozygous among dominant F₂ plants is 1:2 — so among 600 he expected 200 homozygotes and 400 heterozygotes. He reported 201 and 399.

One off the prediction, in either direction. So tidy it almost looks fake on inspection — but here is Fisher's actual catch. Mendel classified each F₂ plant by growing only ten of its F₃ seeds. A heterozygote produces dominant and recessive seed in a 3:1 ratio, so the probability that all ten of its F₃ seeds happen to be dominant — and thus that the parent is misclassified as homozygous — is (3/4)¹⁰ ≈ 5.63%. The expected count of "appears homozygous" is therefore not the unbiased 200 but the corrected 222.5. Mendel's 201 misses this corrected expectation by more than twenty plants — and lands almost exactly on the uncorrected, theoretical value. Slide the bar below to vary the seeds-per-plant: as you push the test toward infinity, the corrected expectation slides back to 200. The arrow above the count shows what an honest 600-plant test, with the seed budget you've set, would expect.

Instrument III · the constancy test — 600 dominant F₂ plants10 seeds tested per plant

seeds tested per plant (Mendel: 10)10

P(heterozygote misclassified)…

E(appears homozygous | 600 plants)…

Mendel reported201

expectation if no correction (k = ∞)200.0

Mendel − E (with correction at this k)…

The pattern is recurring and specific: Mendel's report lands on the cleanest possible theoretical value, even when the experimental procedure pushes the honest expectation somewhere else. That is harder to write off than "good luck in the garden." It looks like data that was thought-about and refined.

V · The hypotheses, none of which closes the case

Eighty years of Mendel-and-Fisher literature has settled on no winner. The candidates, with where they stand:

The dishonest assistant (Fisher 1936)

Fisher's own hypothesis: a helpful gardener counted what he was told to look for. There is no archival evidence of any such assistant — Mendel's surviving notebooks were burned at the abbey after his death — so this remains a speculation, not a finding.

Stopping rules (several authors)

Counting until the ratio "looked right" and then declaring the experiment done would bias results toward the prediction. Edwards (1986a) and Seidenfeld (1998) argued the actual pattern of deviations is not a sequential-stopping signature — Mendel's data has been not merely truncated but shifted.

Selective publication (Fairbanks & Rytting 2001)

Mendel ran more experiments than he published and showed his readers the cleanest. Pires & Branco (2010) fit a "best-of-two" model and recover most of the closeness with a single selection parameter α ≈ 0.20 — close enough to the data, but not a confession.

A biological cause (Weiling 1986)

Some hidden non-independence between offspring of the same plant would reduce variance below the binomial, making "too close" expected. The literature has mostly rejected this: known biology gives no such effect at anywhere near this scale, and the per-experiment p-values are not distributed as the hypothesis would predict.

A word on the burnt notebooks Mendel's own work papers were destroyed at St. Thomas's Abbey after his death in 1884 — the standard account is that his successor as abbot burned them. That archival loss is why the controversy cannot be closed. Every hypothesis above is a hypothesis about what those notebooks would have shown.

VI · What is settled

The biology is not in dispute. The 3:1 and 9:3:3:1 ratios are real, replicated thousands of times in pea, fruit fly, mouse, and man. The two laws Mendel found — segregation and independent assortment — survived their own peculiar disinterment in 1900 and the molecular revolution since. Nothing in this page touches them.

What is settled is something a little different: the published counts are too close to the predictions for the chi-squared distribution to swallow. That is a fact about a printed table, not about an organism. The most defensible reading, after Fisher and Edwards and the 2008 Franklin volume and Pires & Branco's 2010 model, is roughly: the data was selected, filtered, or counted in a way that made the support for the right answer look stronger than honest sampling would have produced. The history of science still owes the table a candid description; we owe the science a separation between the verdict on the result and the verdict on the report.

The laws are real. The data is too tidy. The honest summary is that these are two different sentences, and a venue that recomputes both is the place to keep them separate.

The Wasteland's first overturning reversed the conclusion of a famous experiment by showing the estimator was biased. This page does not overturn Mendel — and that is the point. The estimator here is not biased; the data is. A finished science can carry, inside its founding paper, a published record that no amount of honesty about the procedure makes sense of. That is its own kind of correction.

The check, shown

live, re-running on this page · agrees with research/closer-than-chance/verify.mjs to four decimals

…

Cousins on the ground

Ground Truth · the venue

This is the seventh entry of the verification venue; the sibling that reverses a claim by showing the estimator is biased — The Cold Hand — is the closest in mode. There the published p-value is wrong because the ruler is bent; here the published p-value is wrong because the data was prepared.

How summaries hide what they cannot describe

The shape the numbers cannot see and the how-grouping-lies trilogy are about a summary that fails to capture a real shape. This page is about a summary that over-captures, by some unknown mechanism in the garden.

Sources

Mendel, G. (1866). Versuche über Pflanzen-Hybriden. Verh. naturf. Ver. Brünn, IV (1865), Abhandlungen, 3–47. Primary source. Read 8 Feb. and 8 Mar. 1865 at the Brünn Naturf. Verein. The seven F₂ counts are listed in §3; the dihybrid and the 600-plant constancy test in §6.

Fisher, R. A. (1936). Has Mendel's Work Been Rediscovered? Annals of Science 1(2): 115–137. The original challenge. Combined χ² = 41.6056 on 84 df (Table III, p. 130). Quoted epigraph at top: pp. 132–133.

Edwards, A. W. F. (1986a). Are Mendel's Results Really Too Close? Biological Reviews 61(4): 295–312. The 84-binomial decomposition; the per-test p-value distribution — the most careful re-statistical analysis since Fisher.

Weiling, F. (1986). Real or apparent suppression of Mendel's data? Folia Mendeliana 21: 57–76. The biological-variance hypothesis, generally rejected by the subsequent literature.

Fairbanks, D. J., & Rytting, B. (2001). Mendelian controversies: a botanical and historical review. American Journal of Botany 88(5): 737–752. Selective publication: Mendel reported the cleanest of more experiments he ran.

Franklin, A., Edwards, A. W. F., Fairbanks, D. J., Hartl, D. L., & Seidenfeld, T. (2008). Ending the Mendel-Fisher Controversy. University of Pittsburgh Press. The comprehensive review. Verdict: the data is biased; the mechanism is not certain; outright fraud is unlikely. upittpress.org/books/9780822959861/

Pires, A. M., & Branco, J. A. (2010). A Statistical Model to Explain the Mendel–Fisher Controversy. Statistical Science 25(4): 545–565. A "best-of-two" reporting model: a single parameter α ≈ 0.20 reproduces most of the closeness. χ²/df recomputed there agrees with Fisher to four decimals. projecteuclid.org/.../10-STS342

Numerics: the chi-squared CDF on this page is computed from the regularised lower incomplete gamma function (Numerical Recipes' series + continued-fraction form). The seven F₂ chi-squareds and Fisher's combined figure are independently verified offline by research/closer-than-chance/verify.mjs; the Monte Carlo on this page agrees with the analytic lower-tail probability to less than 0.5% after 200,000 honest runs.