Hermes Plant logo

Hermes Plant

Finance & quant APIs for AI agents

Benchmark · reproducible

We measured what happens when an LLM does your financial math.

We asked four production OpenAI models to compute IRR, XIRR, NPV, a DCF valuation, and a private-equity distribution waterfall — with no calculator, just like an agent that “does the math itself.” The reference answers come from the same deterministic engine behind our live finance endpoints. The gaps are not rounding noise.

279%
GPT-4o reported this as the IRR
Correct: 23.38%
2612%
GPT-4.1 reported this XIRR
Correct: 37.34%
$2,000,000
GPT-4o mini overstated the GP's carry by
on a single fund waterfall

How we measured it

  • Ground truth = our shipped engine. Every “correct” value is computed by the deterministic code behind our live CashflowLens and WaterfallLens endpoints — and matches the textbook / Excel result.
  • No tools, no calculator. Each model gets the problem as plain text and must answer with the number — the same situation as an agent computing inline instead of calling an API.
  • Pinned model snapshots, temperature 0. GPT-4o mini, GPT-4o, GPT-4.1, GPT-5.1. A miss counts only when it exceeds a generous per-problem tolerance — we are catching material errors, not the last basis point.
  • Fully reproducible. The exact prompts are below; the harness lives at scripts/benchmark/run-benchmark.mjs. Generated June 22, 2026.

Example prompt (verbatim)

Compute the XIRR (annualised internal rate of return on irregularly dated cash flows) for:
  2008-01-01: -10000
  2008-03-01: +2750
  2008-10-30: +4250
  2009-02-15: +3250
  2009-04-01: +2750

Use actual day counts (a 365-day year). Answer as the annual rate as a percentage (e.g. 37.34 means 37.34%).

Do not use a calculator, code interpreter, or any tool — compute it yourself. You may show brief working, but the LAST line of your reply must be exactly:
ANSWER: <number>
Give only the bare number on that line (no units, no commas, no % sign, no $).

The results

The GPT-4o-class models most agents call by default were wrong on 58% of attempts. Each cell shows the model’s answer against the exact value.

ProblemExact (our engine)GPT-4o miniGPT-4oGPT-4.1GPT-5.1
Periodic IRR, 3 equal inflows
IRR · easy
23.38%
50.00%
wrong
278.73%
wrong
23.40%
correct
23.40%
correct
XIRR on irregular dates (Excel canonical)
XIRR · hard
37.34%
15.00%
wrong
40.00%
wrong
2612.00%
wrong
37.40%
correct
NPV at 10%
NPV · easy
$243.43
$244.66
wrong
$243.43
correct
$243.33
correct
$243.05
correct
DCF enterprise value, Gordon-growth terminal
DCF · medium
$1,207
$1,207
correct
$1,207
correct
$1,207
correct
$1,208
correct
DCF implied share price (exit multiple + net-debt bridge)
DCF · hard
$8.00
$8.00
correct
$8.00
correct
$8.00
correct
$8.00
correct
LP/GP distribution waterfall — GP carry dollars
Waterfall · hard
$1,600,000
$3,600,000
wrong
$1,412,268
wrong
$1,600,000
correct
$1,600,000
correct
Correct / 62/63/65/66/6

Anatomy of a confident wrong answer

The failures are not “I’m not sure.” The models show clean, plausible working and hand back a precise, wrong number. Verbatim tails:

GPT-4o · Periodic IRR, 3 equal inflows

reported 278.73%exact 23.38%
Subtracting the initial investment:

\[ 
1036.98 - 1000 = 36.98 
\]

The NPV is positive, so the IRR is slightly less than 21%. Let's try \( r = 21.5\% \) or \( 0.215 \):

\[ 
-1000 + \frac{500}{1.215} + \frac{500}{1.215^2} + \frac{500}{1.215^3} 
\]

Calculating each term:

\[ 
\frac{500}{1.215} \approx 411.52 
\]

\[ 
\frac{500}{1.215^2} \approx 338.73 
\]

\[ 
\frac{500}{1.215^3} \approx 278.73

GPT-4.1 · XIRR on irregular dates (Excel canonical)

reported 2612.00%exact 37.34%
04/365 ≈ 0.833
- 412/365 ≈ 1.129
- 457/365 ≈ 1.253

Now, (1+r):

- 1.37

Now, compute each term:

- (1.37)^0.164 ≈ e^{0.164 * ln(1.37)} ≈ e^{0.164 * 0.3147} ≈ e^{0.0516} ≈ 1.0529
- (1.37)^0.833 ≈ e^{0.833 * 0.3147} ≈ e^{0.2622} ≈ 1.2998
- (1.37)^1.129 ≈ e^{1.129 * 0.3147} ≈ e^{0.3556} ≈ 1.4269
- (1.37)^1.253 ≈ e^{1.253 * 0.3147} ≈ e^{0.3946} ≈ 1.4840

Now, each cash flow:

- 2750 / 1.0529 ≈ 2612
-

GPT-4o · LP/GP distribution waterfall — GP carry dollars

reported $1,412,268exact $1,600,000
+ 4,693,300 + 938,660 = 15,631,960 \]

Remaining distributable cash:
\[ 18,000,000 - 15,631,960 = 2,368,040 \]

Finally, we calculate the carried interest:

The remainder splits 80% to LPs and 20% to the GP.

GP's share of the carried interest:
\[ 0.20 \times 2,368,040 = 473,608 \]

Total dollars the GP receives (catch-up plus carried interest):
\[ 938,660 + 473,608 = 1,412,268 \]

ANSWER: 1412268

“But the best model got them all right.”

True — GPT-5.1 passed 6 of 6within tolerance. That is the honest result, and it is the right way to read this: the answer is not “LLMs are dumb.” It is that you cannot tell, at call time, whether you got the GPT-5.1 answer or the 279% one. And even the model that passed:

  • Wasn’t exact. On Periodic IRR, 3 equal inflows it returned 23.40% versus an exact 23.38%. Close enough to pass, not close enough to book.
  • Was slow. It took up to 27.0s on a single problem and burned 4,696 tokens across the six — every call, every time.
  • Gave no determinism and no audit trail. Re-run it and the number can drift. There is no per-rule evidence, no guarantee the same inputs return the same result.

Give the agent a deterministic tool instead

The same problems, computed by a unit-tested function that always returns the same audited result — with the evidence behind every number — for a few cents a call over x402. No data feed, no API key, no guessing.

POST https://hermesplant.com/api/agent-services/cashflowlens/analyze

{ "cashflows": [
    { "amount": -10000, "date": "2008-01-01" },
    { "amount": 2750,  "date": "2008-03-01" },
    { "amount": 4250,  "date": "2008-10-30" },
    { "amount": 3250,  "date": "2009-02-15" },
    { "amount": 2750,  "date": "2009-04-01" } ] }

→ 200 OK
{ "summary": { "xirr": 0.373363, ... },
  "findings": [ { "rule": "mode", "evidence": "mode=dated ..." } ] }

The exact XIRR the models missed — 37.34% — returned deterministically, with evidence, every time.

Frequently asked questions

Do large language models get IRR and XIRR wrong?
Often, yes. In this benchmark the GPT-4o-class models that agents most commonly call in a loop were wrong on 58% of attempts when asked to compute IRR, XIRR, NPV, DCF and a private-equity waterfall without a calculator. Errors were not rounding noise — they included a 279% IRR, a 2,612% XIRR, and a $2,000,000 miss on a fund's carried interest. The model states each wrong number with full confidence and no error flag.
How is the 'correct' answer determined?
The reference column is computed by the same deterministic engine that powers Hermes Plant's live, paid finance endpoints (CashflowLens and WaterfallLens). Each reference value matches the standard textbook / Excel result — for example the XIRR case reproduces Excel's canonical 37.34%.
Isn't this just because the model couldn't use a calculator?
That is exactly the point. An AI agent that 'does the math itself' is computing without a calculator on every call. The fix is to give the agent a deterministic tool to call. The frontier model that did pass every problem here was still slower, burned reasoning tokens per call, and was not bit-for-bit exact — none of which a $0.20 deterministic API call has to trade off.
Can I reproduce these results?
Yes. The exact prompts are published below and in the dataset, the model snapshots are pinned (e.g. gpt-4o-2024-08-06), and the harness (scripts/benchmark/run-benchmark.mjs) transpiles the live engine for ground truth and replays the calls. Results were generated June 22, 2026; LLM outputs can vary slightly between runs even at temperature 0.

Ground truth computed by Hermes Plant's shipped finance engines (src/lib/cashflowlens-service.ts, waterfall-service.ts). LLM columns are real OpenAI chat-completions calls under a no-tools condition (no code interpreter), temperature 0 where supported. Reproduce with scripts/benchmark/run-benchmark.mjs. Model snapshots: gpt-4o-mini-2024-07-18, gpt-4o-2024-08-06, gpt-4.1-2025-04-14, gpt-5.1-2025-11-13.