Benchmark · reproducible
We measured what happens when an LLM does your financial math.
We asked four production OpenAI models to compute IRR, XIRR, NPV, a DCF valuation, and a private-equity distribution waterfall — with no calculator, just like an agent that “does the math itself.” The reference answers come from the same deterministic engine behind our live finance endpoints. The gaps are not rounding noise.
How we measured it
- Ground truth = our shipped engine. Every “correct” value is computed by the deterministic code behind our live CashflowLens and WaterfallLens endpoints — and matches the textbook / Excel result.
- No tools, no calculator. Each model gets the problem as plain text and must answer with the number — the same situation as an agent computing inline instead of calling an API.
- Pinned model snapshots, temperature 0. GPT-4o mini, GPT-4o, GPT-4.1, GPT-5.1. A miss counts only when it exceeds a generous per-problem tolerance — we are catching material errors, not the last basis point.
- Fully reproducible. The exact prompts are below; the harness lives at
scripts/benchmark/run-benchmark.mjs. Generated June 22, 2026.
Example prompt (verbatim)
Compute the XIRR (annualised internal rate of return on irregularly dated cash flows) for:
2008-01-01: -10000
2008-03-01: +2750
2008-10-30: +4250
2009-02-15: +3250
2009-04-01: +2750
Use actual day counts (a 365-day year). Answer as the annual rate as a percentage (e.g. 37.34 means 37.34%).
Do not use a calculator, code interpreter, or any tool — compute it yourself. You may show brief working, but the LAST line of your reply must be exactly:
ANSWER: <number>
Give only the bare number on that line (no units, no commas, no % sign, no $).The results
The GPT-4o-class models most agents call by default were wrong on 58% of attempts. Each cell shows the model’s answer against the exact value.
| Problem | Exact (our engine) | GPT-4o mini | GPT-4o | GPT-4.1 | GPT-5.1 |
|---|---|---|---|---|---|
Periodic IRR, 3 equal inflows IRR · easy | 23.38% | 50.00% wrong | 278.73% wrong | 23.40% correct | 23.40% correct |
XIRR on irregular dates (Excel canonical) XIRR · hard | 37.34% | 15.00% wrong | 40.00% wrong | 2612.00% wrong | 37.40% correct |
NPV at 10% NPV · easy | $243.43 | $244.66 wrong | $243.43 correct | $243.33 correct | $243.05 correct |
DCF enterprise value, Gordon-growth terminal DCF · medium | $1,207 | $1,207 correct | $1,207 correct | $1,207 correct | $1,208 correct |
DCF implied share price (exit multiple + net-debt bridge) DCF · hard | $8.00 | $8.00 correct | $8.00 correct | $8.00 correct | $8.00 correct |
LP/GP distribution waterfall — GP carry dollars Waterfall · hard | $1,600,000 | $3,600,000 wrong | $1,412,268 wrong | $1,600,000 correct | $1,600,000 correct |
| Correct / 6 | 2/6 | 3/6 | 5/6 | 6/6 |
Anatomy of a confident wrong answer
The failures are not “I’m not sure.” The models show clean, plausible working and hand back a precise, wrong number. Verbatim tails:
GPT-4o · Periodic IRR, 3 equal inflows
Subtracting the initial investment:
\[
1036.98 - 1000 = 36.98
\]
The NPV is positive, so the IRR is slightly less than 21%. Let's try \( r = 21.5\% \) or \( 0.215 \):
\[
-1000 + \frac{500}{1.215} + \frac{500}{1.215^2} + \frac{500}{1.215^3}
\]
Calculating each term:
\[
\frac{500}{1.215} \approx 411.52
\]
\[
\frac{500}{1.215^2} \approx 338.73
\]
\[
\frac{500}{1.215^3} \approx 278.73GPT-4.1 · XIRR on irregular dates (Excel canonical)
04/365 ≈ 0.833
- 412/365 ≈ 1.129
- 457/365 ≈ 1.253
Now, (1+r):
- 1.37
Now, compute each term:
- (1.37)^0.164 ≈ e^{0.164 * ln(1.37)} ≈ e^{0.164 * 0.3147} ≈ e^{0.0516} ≈ 1.0529
- (1.37)^0.833 ≈ e^{0.833 * 0.3147} ≈ e^{0.2622} ≈ 1.2998
- (1.37)^1.129 ≈ e^{1.129 * 0.3147} ≈ e^{0.3556} ≈ 1.4269
- (1.37)^1.253 ≈ e^{1.253 * 0.3147} ≈ e^{0.3946} ≈ 1.4840
Now, each cash flow:
- 2750 / 1.0529 ≈ 2612
-GPT-4o · LP/GP distribution waterfall — GP carry dollars
+ 4,693,300 + 938,660 = 15,631,960 \]
Remaining distributable cash:
\[ 18,000,000 - 15,631,960 = 2,368,040 \]
Finally, we calculate the carried interest:
The remainder splits 80% to LPs and 20% to the GP.
GP's share of the carried interest:
\[ 0.20 \times 2,368,040 = 473,608 \]
Total dollars the GP receives (catch-up plus carried interest):
\[ 938,660 + 473,608 = 1,412,268 \]
ANSWER: 1412268“But the best model got them all right.”
True — GPT-5.1 passed 6 of 6within tolerance. That is the honest result, and it is the right way to read this: the answer is not “LLMs are dumb.” It is that you cannot tell, at call time, whether you got the GPT-5.1 answer or the 279% one. And even the model that passed:
- Wasn’t exact. On Periodic IRR, 3 equal inflows it returned 23.40% versus an exact 23.38%. Close enough to pass, not close enough to book.
- Was slow. It took up to 27.0s on a single problem and burned 4,696 tokens across the six — every call, every time.
- Gave no determinism and no audit trail. Re-run it and the number can drift. There is no per-rule evidence, no guarantee the same inputs return the same result.
Give the agent a deterministic tool instead
The same problems, computed by a unit-tested function that always returns the same audited result — with the evidence behind every number — for a few cents a call over x402. No data feed, no API key, no guessing.
POST https://hermesplant.com/api/agent-services/cashflowlens/analyze
{ "cashflows": [
{ "amount": -10000, "date": "2008-01-01" },
{ "amount": 2750, "date": "2008-03-01" },
{ "amount": 4250, "date": "2008-10-30" },
{ "amount": 3250, "date": "2009-02-15" },
{ "amount": 2750, "date": "2009-04-01" } ] }
→ 200 OK
{ "summary": { "xirr": 0.373363, ... },
"findings": [ { "rule": "mode", "evidence": "mode=dated ..." } ] }The exact XIRR the models missed — 37.34% — returned deterministically, with evidence, every time.
Frequently asked questions
- Do large language models get IRR and XIRR wrong?
- Often, yes. In this benchmark the GPT-4o-class models that agents most commonly call in a loop were wrong on 58% of attempts when asked to compute IRR, XIRR, NPV, DCF and a private-equity waterfall without a calculator. Errors were not rounding noise — they included a 279% IRR, a 2,612% XIRR, and a $2,000,000 miss on a fund's carried interest. The model states each wrong number with full confidence and no error flag.
- How is the 'correct' answer determined?
- The reference column is computed by the same deterministic engine that powers Hermes Plant's live, paid finance endpoints (CashflowLens and WaterfallLens). Each reference value matches the standard textbook / Excel result — for example the XIRR case reproduces Excel's canonical 37.34%.
- Isn't this just because the model couldn't use a calculator?
- That is exactly the point. An AI agent that 'does the math itself' is computing without a calculator on every call. The fix is to give the agent a deterministic tool to call. The frontier model that did pass every problem here was still slower, burned reasoning tokens per call, and was not bit-for-bit exact — none of which a $0.20 deterministic API call has to trade off.
- Can I reproduce these results?
- Yes. The exact prompts are published below and in the dataset, the model snapshots are pinned (e.g. gpt-4o-2024-08-06), and the harness (scripts/benchmark/run-benchmark.mjs) transpiles the live engine for ground truth and replays the calls. Results were generated June 22, 2026; LLM outputs can vary slightly between runs even at temperature 0.
Ground truth computed by Hermes Plant's shipped finance engines (src/lib/cashflowlens-service.ts, waterfall-service.ts). LLM columns are real OpenAI chat-completions calls under a no-tools condition (no code interpreter), temperature 0 where supported. Reproduce with scripts/benchmark/run-benchmark.mjs. Model snapshots: gpt-4o-mini-2024-07-18, gpt-4o-2024-08-06, gpt-4.1-2025-04-14, gpt-5.1-2025-11-13.