Docs · Posterior Probability

§1

Background

Sports forecasting reports outcomes as point predictions: a single win probability, a single projected margin, a single ranking. For a regular-season game between two teams that have each played seventy times, point predictions are usually fine. For a playoff series where the same two teams meet at most seven times, a single number hides what we actually want to know, which is how much evidence we have for it.

Bayesian inference is the natural fit. The Beta family of distributions is closed under Bernoulli sampling, so the per-game win probability has a closed-form posterior after every game in a series. The same posterior gives a closed-form joint distribution over the eventual winner and the series length, and it gives a credible interval that the moneyline-implied probability either falls inside or outside. Each of these objects is a probability statement about an unknown parameter, not a long-run frequency claim about a procedure.

The four modules of this project are each one chapter of Wackerly, Mendenhall and Scheaffer (7th ed.) applied to a best-of-seven series. Module 1 is maximum-likelihood estimation in a paired-comparison model. Module 2 is conjugate Bayesian updating. Module 3 is the joint distribution of two random variables, plus its marginals and conditionals. Module 4 is hypothesis testing, multiple-testing correction, and a worked decision-theoretic rule.

§2

Thesis and goal

The question this project answers is:

For an NBA playoff series, what does a Bayesian model with a sensible prior say about the win probability, the eventual winner, and the series length? Where does it disagree with the sportsbook, and how confident is it in those disagreements?

The goal is uncertainty quantification, not point prediction. Every number the app shows is paired with an interval whose interpretation is fixed by the derivation. A 95% equal-tailed credible interval on p is the set of values such that the posterior assigns 2.5% probability below the lower endpoint and 2.5% above the upper endpoint; a 95% Wald confidence interval on a Bradley-Terry log-strength is the asymptotic interval whose long-run coverage is 95% under repeated sampling of seasons of the same length. The two are not the same object and the app keeps them distinct.

The market-vs-model section is the most direct test of the framework. If the de-vigged moneyline-implied probability sits outside the 95% credible interval, the model rejects the hypothesis that the market price is a plausible value of p. The rejection is a statement about the model and its prior, not a claim of a profit edge.

§3

Project design

The four modules build on each other. Module 1 fits a schedule-adjusted estimate of each team's per-game strength from the regular-season results. Module 2 takes the two strengths in a matchup, turns them into a Beta prior on the per-game win probability p, and updates that prior game by game as a series unfolds. Module 3 reuses the same posterior to compute the joint distribution of the eventual winner and the series length in closed form. Module 4 takes the Bayesian credible interval that comes out of Module 2 and uses it to test the sportsbook's de-vigged moneyline against the model.

The data spine is the public NBA stats endpoints, one row per regular-season game, feeding a small local database that the four pages of the app read from. Moneylines on the watchlist are entered by the user. Nothing user-specific is stored.

The four pages of the app correspond one-to-one with the four modules: Team strength, Series tracker, Joint, and Watchlist. The next four sections describe what each one computes.

§4 · Module 1

Bradley-Terry team strength

The Bradley-Terry model is a paired-comparison model that assigns each team i a positive strength θ_i. The probability that team i beats team j in any single game between them is

P(i beats j | θ) = θ_i / (θ_i + θ_j).

The reparameterisation β_i = ln θ_i − ln θ₁ anchors the model at team 1 and turns the score equations into a logistic regression on game-level indicators. The data are the season's pairwise win counts w_ij.

MLE via MM iteration

The likelihood has no closed-form maximiser, so the MLE is fit by the minorisation-maximisation iteration of Hunter (2004). The update is monotone in the log-likelihood and converges from any positive starting vector under the connectedness condition of Ford (1957) on the win graph. To keep that condition satisfied unconditionally, every win/loss count is shifted by ε = 0.5 before the fit, which is the pseudo-count regulariser discussed in Hunter §6 to §7.

Wald confidence intervals

On the log-strength scale the model is a logistic regression, so the inverse Fisher information matrix supplies asymptotic standard errors on each β̂_i. The 95% Wald interval is

β̂_i ± z_0.975 · SE(β̂_i).

The intervals on the team-strength page are these intervals, painted on a bad-to-good gradient track. A team whose interval sits entirely on the right of the league mean is one whose schedule-adjusted strength the data has localised; a team whose interval straddles the mean is one whose strength the season has not yet pinned down.

§5 · Module 2

Beta-Binomial series tracker

For a series between teams A and B, let p denote the probability that A wins any single game. Conditional on p, the games are modelled as i.i.d. Bernoulli trials. The conjugate prior on p is a Beta distribution.

Prior from Module 1

The prior p ~ Beta(α₀, β₀) is centred at the Bradley-Terry win probability:

α₀ + β₀ = prior strength, α₀ / (α₀ + β₀) = θ̂_A / (θ̂_A + θ̂_B).

The effective sample size α₀ + β₀ is exposed as a numeric input on the series tracker and the watchlist. A small value puts most of the weight on the games actually observed in the playoff series; a large value keeps the posterior close to the prior unless the series goes long. The default is ten prior games, which makes the posterior after a typical six-game series roughly a 60/40 mixture of data and prior.

Conjugate posterior

After k wins by A in the first n games of the series, the posterior is closed form:

p | data ~ Beta(α₀ + k, β₀ + n − k).

The posterior mean is a precision-weighted average of the prior mean and the in-series MLE k/n:

E[p | data] = (α₀ + k) / (α₀ + β₀ + n).

With every game added on the series page, both shape parameters update by exactly one and the density curve is redrawn from the new pair. The plot's vertical line marks the posterior mean; the shaded band marks the 95% credible interval below.

Credible interval

The 95% equal-tailed credible interval is the pair (L, U) such that

P(p < L | data) = 0.025, P(p > U | data) = 0.025.

Both endpoints are quantiles of Beta(α₀ + k, β₀ + n − k), and they are evaluated by the regularised incomplete-beta inverse. The credible interval is a probability statement about p: the posterior assigns 95% probability to values inside the interval. It is not a long-run coverage claim about the procedure.

§6 · Module 3

Joint distribution of (winner, length)

Once the posterior on p is in hand, the eventual winner W ∈ {A, B} and the series length N ∈ {4, 5, 6, 7} have a joint distribution induced by the same model. Conditional on p, the length-given-winner probability is negative-binomial.

Closed-form joint

Let r = 4 be the wins needed to take the series. Conditional on p, the probability that A wins the series in exactly n games is

P(W=A, N=n | p) = C(n−1, r−1) p^r(1−p)^n−r.

Marginalising against the Beta posterior on p turns each cell into a ratio of Beta functions, which gives an eight-cell joint with no numerical integration:

P(W=A, N=n | data) = C(n−1, r−1) · B(α+r, β+n−r) / B(α, β).

The joint page renders this as a 2×4 heatmap with marginals attached on the top and right. Summing across rows recovers the marginal winner probability; summing across columns recovers the marginal length distribution. The eight cells sum to one as a numerical check on the implementation.

Conditioning on length

The condition-on-length toggle slices the joint along a single column. Conditional on N=7, the winner probability is

P(W=A | N=7, data) = P(W=A, N=7 | data) / P(N=7 | data).

The qualitative behaviour the slice exhibits: even when A is the favourite, conditioning on a seven-game series pulls the winner probability toward 1/2. A series that goes the distance is one where the per-game evidence has been close to even, and the conditional reflects that.

§7 · Module 4

Hypothesis testing the market

For each upcoming game on a slate, the sportsbook publishes a moneyline pair for teams A and B. The implied probabilities of the two outcomes sum to more than one, the difference being the bookmaker's overround or vig. A de-vigging procedure converts the moneyline pair into a fair probability p̃_mkt that the model can compare against.

De-vigging the moneyline

The watchlist toggle exposes two methods. The default is proportional de-vigging: divide each implied probability by their sum. The alternative is Shin (1993), which solves for the level of insider trade that the bookmaker is hedging against and removes it. Shin's correction has a closed form for two outcomes:

p̃_shin(i) = (√(z² + 4(1−z) q_i²/S) − z) / (2(1−z)),

where q_i are the raw implied probabilities, S = q_A + q_B is the booksum, and z is the Shin mixing parameter that solves the consistency equation in Shin (1993). On a two-outcome market the parameter has a closed form, so no root-finding is needed. When the overround is small, Shin and proportional de-vigging agree to within a fraction of a percent; when it is large or asymmetric, Shin pushes the favourite further from 1/2 than proportional de-vigging does.

Credible-region rule

The primary decision rule on the watchlist is Bayesian: a game is flagged when the de-vigged market probability falls outside the model's 95% credible interval on p. Formally,

reject ⇔ p̃_mkt ∉ [L, U].

This is the Bayesian dual of a hypothesis test: the null is "the market price is a plausible value of p", and the credible region is the set of values the data does not rule out. The test inherits its interpretation from the posterior, not from the long-run behaviour of a procedure. The watchlist also reports a posterior tail probability at p̃_mkt, the two-sided posterior mass on the wrong side of the market price, which is the smallest 1 − α at which the rule still flags the row.

Frequentist contrasts

For symmetry, the same data also feeds two frequentist tests of the null p = p̃_mkt: a large-sample two-sided Z test on the Bernoulli MLE k/n with the null-variance form

Z = (k/n − p̃_mkt) / √(p̃_mkt(1 − p̃_mkt) / n),

and the binomial likelihood-ratio statistic G = −2 ln λ, referred to chi-square(1) under Wilks' theorem. For the small n a single playoff series supplies, both tests have very little power and rarely reject. Showing them next to the credible-region decision is part of the point: the project frames Bayesian and frequentist inference as parallel readings of the same evidence, not as competitors.

Multiple-testing correction

With several series in flight on a slate, the family-wise false-rejection rate compounds. The watchlist exposes a three-way switch: uncorrected (per-game α), Bonferroni, or Benjamini-Hochberg. Bonferroni controls the family-wise error rate by tightening every individual test to level α/m, where m is the slate size; Benjamini-Hochberg controls the false-discovery rate at level α by ordering the per-game tail probabilities and rejecting every test whose rank-k ordered p-value is at most k·α/m. Benjamini-Hochberg is the project default because Bonferroni on a typical slate suppresses so many real disagreements that the page becomes uninformative; Bonferroni stays available as a sensitivity check.

Kelly fraction

For a binary bet at quoted decimal odds d with model probability p, the Kelly fraction is the maximiser of E[ln W]. The closed form is

f* = (b p − q) / b, b = d − 1, q = 1 − p.

The fraction is positive when p > 1/d and zero otherwise. Because f* is sensitive to errors in p when p is close to 1/d, the watchlist reports three Kelly variants on every row: the full fraction f* with the posterior mean substituted for p, half-Kelly f*/2, and a lower-credible- interval Kelly that substitutes the lower endpoint of the posterior interval for p. The lower-CI variant is the most conservative of the three and reaches zero when the credible interval still contains 1/d. All three are shown for context; none is a betting recommendation.

Kelly uses the quoted decimal odds, not the de-vigged probability. The de-vigged probability is the right object for comparing the model with the market because it removes the bookmaker's overround; the quoted odds are the right object for sizing a bet because they describe the actual payout.

§8

What the app shows you

The four pages each render one of the four modules end to end on live data.

Team strength. The Bradley-Terry MLE for the current season, ranked by log-strength β̂_i, with 95% Wald intervals painted on a bad-to-good gradient. A naive win-percentage ranking sits beside it; a rank shift of three or more between the two flags a team whose record is most distorted by schedule strength.
Series tracker. For any pair of teams, the prior is set by Module 1 and the posterior on p updates after every game added to the series. The page renders the density curve, the posterior mean, the 95% credible interval, and the predictive probabilities for the eventual winner and series length.
Joint. The 2×4 heatmap of P(W, N | data) with both marginals and the condition-on-length slice. Cells sum to one; the marginals on the edges are the row and column sums.
Watchlist. The slate-level table where the model and the market are compared. Each row carries the model's posterior mean, the de-vigged market probability, the credible interval, the reject/no-reject decision under the chosen correction, and the Kelly fraction. Sortable by any column.

§9

Limitations and future directions

Every assumption of the model is a place where the model can be wrong. The ones below are the assumptions that matter the most.

Within-series IID

Module 2 treats the games of a series as i.i.d. Bernoulli given p. Real series are not i.i.d.: there is a home-and-away pattern, scheduling rest, in-series momentum, adjustments between coaches, and minor injuries that accumulate across games. The Beta-Binomial machinery absorbs all of these into a single per-game probability and reports its uncertainty; it does not separate them.

Home-court advantage

The Bradley-Terry fit in Module 1 includes a multiplicative home factor, but the series tracker does not propagate it to the per-game posterior. The series posterior is the same regardless of which team has home court in any given game. Adding a home-aware likelihood to Module 2 is the single largest model improvement that is still in scope of the original textbook chapters.

No injury data

Lineups can swing a series posterior more than a single game's evidence. The model does not see them. A practical extension is to widen the prior or shift its mean when a top-eight rotation player is ruled out, but the project does not currently ingest an injury report.

Prior sensitivity

The prior is informed by the Bradley-Terry MLE, which is itself sensitive to the pseudo-count ε and to the connectedness of the win graph. With the default α₀ + β₀ = 10, the posterior after a seven-game series is roughly 40% prior and 60% data; raising the prior strength to 30 pushes the mixture heavily toward the prior. The control on the series page exists so that this sensitivity is visible rather than hidden.

Time stationarity

The Bradley-Terry fit assumes a single team strength per season. Real strengths drift inside a season as players return from injury, get traded, or change roles. A future iteration could fit a state-space variant, at the cost of losing the closed-form MLE.

The market is not just opinion

Sportsbook prices reflect public action, balance-of-book considerations, and the bookmaker's loss-limiting behaviour against suspected sharp action, on top of the market's forecast. A rejection of p̃_mkt ∈ [L, U] is a rejection of the joint hypothesis "the price is a plausible value of p", not just "the market forecast is wrong". The watchlist's framing is "where does the model disagree with the market", not "where can we beat the market".

Future directions

A home-aware likelihood for Module 2, lifting the home factor from Module 1 into the per-game probability.
An empirical reliability diagnostic that compares Shin and proportional de-vigging on a real season's slate of moneylines and graded outcomes.
A Module 5 chi-square goodness-of-fit on the historical distribution of best-of-seven series lengths against the model's predicted marginal P(N).
An injury-aware prior that widens or shifts the Beta before the series begins, given a publicly-reported rotation absence.
Caching of the per-season Bradley-Terry fit so that the watchlist endpoint does not re-run the MM iteration on every request.
An automated odds feed for the watchlist. The ingest path that snapshots moneylines from The Odds API is already wired up on the backend; what is missing is a read route that surfaces the latest snapshot and a slate-editor button that pulls those numbers into the form, so a user can pick up consensus prices instead of typing them in by hand.

§10

Sources

The textbook used throughout the derivations is Wackerly, Mendenhall and Scheaffer, Mathematical Statistics with Applications, 7th edition (2008). The works listed below are the ones each module's bibliography cites and that the implementation actually uses.

Wackerly, D. D., Mendenhall, W. and Scheaffer, R. L. (2008). Mathematical Statistics with Applications, 7th edition. Thomson Brooks/Cole. Source for every closed- form result the four modules invoke: maximum likelihood and the Wald interval (§9.7 to §9.8), the Beta distribution and conjugate Bayesian updating (§4.7, §16.2 to §16.5), negative-binomial structure (§3.6), large-sample tests and their CI duality (§10.2 to §10.6), the likelihood-ratio test and Wilks' theorem (§10.11), and the Kelly criterion's expected-log argument (§5.5).
Hunter, D. R. (2004). "MM algorithms for generalized Bradley-Terry models." The Annals of Statistics, 32(1), 384-406. Source for the cyclic minorisation- maximisation iteration in Module 1, the multiplicative home-field parameter, and the pseudo-count regulariser.
Ford, L. R. (1957). "Solution of a ranking problem from binary comparisons." The American Mathematical Monthly, 64(8), 28-33. Source of the win-graph connectedness condition under which the Bradley-Terry MLE exists and is unique; Hunter's convergence proof in Module 1 takes this as Assumption 1.
Lange, K. (1995). "A gradient algorithm locally equivalent to the EM algorithm." Journal of the Royal Statistical Society, Series B, 57(2), 425-437. Source of the Liapounov-style global-convergence theorem cited by Hunter's proof in Module 1.
Shin, H. S. (1993). "Measuring the incidence of insider trading in a market for state-contingent claims." The Economic Journal, 103(420), 1141-1153. Source of the Shin de-vigging method exposed as the alternative to proportional de-vigging on the watchlist.
Kelly, J. L. (1956). "A new interpretation of information rate." Bell System Technical Journal, 35(4), 917-926. Original derivation of the optimal log-growth fraction reported on the watchlist.
Benjamini, Y. and Hochberg, Y. (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society, Series B, 57(1), 289-300. Source of the false-discovery-rate procedure the watchlist uses as its default slate-level correction.

Data sources: the public NBA stats endpoints, accessed through the nba_api Python wrapper. Regular- season game results, one row per game.

§11

Remarks

Dr. McCarty, I have had you for Calculus III, Abstract Algebra, Complex Variables, Probability, and now this class. I have not been your best student in any of them, and that is something I think about. I built this project because I wanted you to see that I do care about mathematics and computer science, and that your classes are a big part of the reason I do. The way you work through a concept in lecture is something I want to keep doing in my own work, whether I end up on models like the ones in this app or on something completely different.

Wherever I land after graduation, I would like to stay in touch. If I build something I think you would find interesting, or come across a derivation that reminds me of one of your lectures, I would be glad to send it your way. I hope this project is up to your standards. Thank you for everything you have taught me, and for being one of the few professors I will actually miss.