The short version
- 84.9% on our highest-confidence picks (top-conviction subset, n=52, 2025-26 holdout).
- 75.6% across 2026 fights (n=90, model frozen 2026-01-01 — no 2026 data in training).
- 71.59% on consensus picks (accuracy-weighted four-model ensemble, n=428, 2025-26 holdout).
- Independently checked for data leakage three times — two separate Opus reviews plus an independent re-verification. Walk-forward validated. Zero future-data leakage on the production path.
Every one of those is true at the sample it's measured on, and verified against a dated internal artifact (last verified 2026-05-31). The rest of this page is the full context — how every number is measured, the sample size behind it, and the honest caveats — that makes the strong ones trustworthy.
The model: four voices, one prediction
FightIQ's production prediction is a consensus of four independent signals, blended by a logistic-regression stacker:
- V14 BLEND — the original dual ensemble (Logistic Regression + XGBoost + LightGBM + CatBoost), run in both market-aware and market-blind paths. Our longest-running voice.
- V15 + feature packs — a newer model adding rolling-window form (last 3/5/10 fights), Glicko-style ratings, cross-organisation Elo and fighter rankings. The single most accurate model on its own.
- Paperclip — a structurally different model: separate codebase, independent feature engineering, and a different data pipeline, though it ultimately draws on the same public UFCStats fight record. Because it makes different mistakes from V14 and V15 — which lean heavily correlated, sharing a dataset and a feature philosophy — it carries the most independent signal in the stack, and the stacker weights it most heavily.
- Market closing odds — the betting market is a strong predictor in its own right. We treat it as a feature, not gospel.
A logistic-regression meta-model learns how much to trust each voice on each fight, and outputs one probability plus a confidence tier.
Why four and not one?
V14 and V15 are close cousins — same dataset, same feature philosophy — so they tend to be right and wrong together. Paperclip and the market are the diversity that makes the ensemble more than the sum of its parts. In our own testing, blending the voices produced a substantially larger accuracy gain than adding any individual new feature did — the lift lives in the ensemble, not in the next clever feature. The blend is the product.
On V14: we briefly planned to retire V14 in May 2026, then reversed. It stays in the stack as insurance — if Paperclip ever stops being maintained or its data drifts, V14 is our backup independent voice. Keeping it costs us almost nothing and removes a single-point-of-failure risk.
Confidence tiers
Not every pick is equal. Each prediction lands in a tier:
| Tier | What it means | How to read it |
|---|---|---|
| LOCK | Highest conviction | Strongest historical accuracy — but see the honest footnote below |
| HIGH | Strong conviction | High hit rate at more volume than LOCK |
| MED | Solid, worth tracking | Decent edge, more variance |
| LOW / TOSS-UP | Coin flip | We tell you when we don't know |
On the 84.9% number: that figure is our highest-confidence subset (n=52) — the picks where the model's own probability is strongest. It's related to, but not the same thing as, the market-favorite-defined LOCK tier below; we keep the two distinct rather than blur them.
Honest footnote on LOCK: the LOCK tier fires when our pick aligns with an extreme market favorite (−300 or shorter at close). Its high hit rate is therefore partly genuine model signal and partly mechanical agreement with a market that is itself usually right about heavy favorites. We flag this rather than let the number stand unqualified.
How we validate (and why that matters)
A prediction model is only as honest as its test. Four things keep ours straight:
Walk-forward validation — not a random split. This is the single most important thing to understand, and the first thing any skeptic should check. We do not shuffle all fights into a pile and randomly hold some out — that would let the model learn from fights that happened after the one it's predicting, which inflates accuracy and means nothing in the real world. Instead we move forward in time: to predict a fight, the model is trained only on fights that occurred strictly before it. The way you'd actually use it.
Career stats updated after feature extraction. A fighter's record is frozen as it stood before each fight. Their future results never leak backward into a past prediction. We audited out six career-aggregate features that were doing exactly that kind of backward leak.
A defined holdout window, chosen for a reason. Our headline accuracy/ROI numbers are measured on the 2025-2026 holdout — recent fights the model's evaluation treats as unseen. We deliberately don't lead with all-time numbers: the COVID-era cards (2020-2021, empty arenas, disrupted camps, short-notice replacements) are statistically unrepresentative of how the sport behaves now, and including them muddies the read. Recent, representative fights are the honest test.
Independent leak audits. The pipeline has been checked for data leakage three separate times — two independent Opus reviews plus a separate re-verification. All came back clean on the production path. One Opus review found a leak in an unused legacy backtest file — which we don't ship from — and we isolated it. The re-verification ran a shuffled-label baseline: with the outcome labels randomised, an honest model should score ~50% by chance — ours scored 50.9%, against 65%+ on the real labels. That ~14-point gap is a decisive no-leak signal.
The numbers — all of them, with sample sizes
We cite n on everything. A number without its sample size is marketing, not evidence.
Accuracy (2025-26 holdout unless noted)
| Metric | Value | n | Notes |
|---|---|---|---|
| Highest-confidence subset | 84.9% | 52 | Our top-conviction picks; small sample — see caveat below |
| 2026 fights | 75.6% | 90 | V14 model, frozen 2026-01-01; no 2026 data in training |
| UNANIMOUS picks (all three models agree) | 76.21% | 311 | High-confidence cohort (ROI sample n=307) |
| Consensus (accuracy-weighted ensemble) | 71.59% | 428 | The all-picks ensemble number |
| Stacking ensemble (4-input) | 70.66% | 1,060 | Production stack, 2023-26 full holdout (2025-26 subset: 69.86%, n=428) |
| V15 (best single model) | 69.05% | 428 | Strongest standalone voice |
These are backtest/holdout figures. The 2026-only number sits a touch above the full-holdout consensus because 2026 has been a higher-favorite stretch — chalkier cards are easier to call. Across the complete 2025-26 holdout the four-model consensus is the top performer; that's exactly why the blend, not any single model, is the product. How the flagged picks have performed live is covered below, under "The live trial."
Return on investment (real-market closing odds, 1u flat unless noted)
| Metric | Value | n | Basis |
|---|---|---|---|
| High-confidence subset | +14.29% | 52 | Book median closing odds |
| ¼-Kelly stacking ensemble | $1,000 → $29,070 (+2,807%) | 706 bets | 2023-26 — with a 59% peak drawdown |
How the high-confidence picks have performed live is covered below, under "The live trial."
On the Kelly number: a 59% peak drawdown means that, at the worst point, a ¼-Kelly bankroll was down 59% from its high before recovering. The growth figure is real; so is the stomach you'd need to sit through the drawdown. We state both in the same breath, always.
On ROI basis: all ROI figures are computed against actual sportsbook closing prices — vig included — not hypothetical fair odds. Fair-odds returns look much better and mean much less; we don't headline them.
The live trial — calling cards before they happen
Backtests are necessary but not sufficient, so since April 2026 we've called every card before it happens. FightIQ has never said "bet every fight" — the tier system exists to tell you which picks to trust, and the high-confidence and unanimous-consensus picks are the product.
On those flagged picks, the live record is strong. Across 8 live cards (Apr 11 – Jun 7 2026), on high-confidence calls the model has gone 26 for 31 (83.9%). The full picture, with all strata honestly stated: 62.0% on all picks (44 of 71) across every fight the system has called this year — including the toss-up and low-confidence tier we explicitly tell readers to skip. On the stack-era subset (cards run with the production 4-model stack), 66.1% (39 of 59).
We lead with the high-confidence number because that's the tier the product actually points readers at; the all-picks number is on this page because we don't hide it. The tier system exists precisely so the lower-confidence picks can be filtered out — the live record matters most on the picks the model itself flags.
The honest caveat: it's still a young live sample, and UFC is high-variance — even the high-confidence tier loses sometimes. But that's where the model's real signal shows, and it's what we point you to. The lower-confidence and TOSS-UP picks are flagged precisely so you can skip them — we have never told anyone to bet every fight, and the track record that matters is on the picks we actually stand behind.
How good can a UFC model actually get?
There's a hard ceiling on UFC prediction — around 65–70% on the winner for anyone working from public fight data, and the betting market itself sits right at that line. FightIQ operates at that frontier: our consensus hits 72% on the holdout, and our highest-confidence picks reach into the mid-80s. The difference between us and a confident-sounding competitor isn't a bigger number — it's that ours is walk-forward validated, leak-audited three times, and published with its sample size. When a site claims 90%+ on real fights, that's almost always future-data leaking in. We show you a real number at the ceiling instead of a fantasy above it.
We're sharpest on picking winners — that's the product. We also model the method (KO / submission / decision) and whether a fight goes the distance; those beat chance comfortably (method ~53% vs 33% random; distance ~61% vs 50%) but we treat them as supporting signals, not headline claims.
Why you can trust these numbers
Most prediction sites quietly inflate their numbers and never tell you when they were wrong. We do the opposite — and it's the whole point of the brand.
When we found a bug in how career submission stats were counted, we fixed it across every model and rebuilt every prediction. When an internal audit caught a data-join error that had flattered our returns, we corrected 23,575 rows and revised our numbers down — the figures on this page are the honest, post-correction ones. We publish our validation method, our sample sizes, and the corrections we've made to our own data, because a track record you can actually check is worth more than one you can't.