How we tested ourselves

We graded ourselves, and published the whole table.

Before selling a single report, we ran the Verdict engine blind against three published investigations: two forensic short reports and one investigative feature. Then we compared the output to the published work, finding by finding. This page is the comparison, including what we missed.

This is an internal comparison, measured against publicly available published reports. No third party graded it. We're publishing the full finding-by-finding table so you can check our work.

The protocol

Run blind. Then compare.

For each test, the engine researched the company with no access to the reference investigation, produced a full report, and only then was the output scored against the published work: which findings it matched, which it added, and which it missed. Every report is scored on the same six axes. This is the rubric behind any score you see from us.

Axis	What it measures	Pass bar
Verdict direction	Did the call match the reference's direction? Matching or more conservative passes; more optimistic fails. Over-caution is acceptable, over-optimism is not.	match / more cautious
Red-flag recall	Of the reference's key red flags, the share the engine surfaced independently.	≥ 70%
Non-obvious adds	Material findings the engine surfaced that the reference did not.	≥ 2
Hallucinations	Factual claims contradicted by their own cited sources, via the machine-checked claim table.	0 load-bearing
Theme completeness	Narrative threads present in the engine's own sources that made it into the report.	no whole-thread miss
Consistency	Dates, entities, and numbers agree across the report, enforced by a deterministic linter.	0 errors

The comparison

Three benchmarks. Finding by finding.

The three references are excellent professional work; that is exactly why we measured against them. Where a reference makes allegations about a company, they remain that publication's allegations. The test measures one thing only: whether Verdict, run blind, independently surfaced the same signals.

Case 01 FTAI Aviation vs. Muddy Waters Research, short report, Jan 2025 · run blind Jun 2026

A forensic short thesis on a NASDAQ-listed aerospace company, built on revenue-recognition and depreciation mechanics. Verdict, run blind, reached the same cautious direction with two named disqualifiers, and independently found a second published short thesis and a securities class action the brief never mentioned to it.

Matched independently

The per-unit economics gap at the center of the thesis: booked revenue per unit far above what customers were actually charged
Peer-benchmarked anomalies: revenue and EBITDA per employee multiples out of line with named comparables
The channel-stuffing and depreciation mechanisms the thesis alleged
A cautious verdict forced by live, unresolved allegations

Added beyond the reference

A sized valuation-downside range
The GAAP free-cash-flow vs. adjusted-EBITDA contradiction
A litigation-exposure band incl. the class action
That TAM and share figures were computed from the disputed revenue itself, so they cannot corroborate it

Missed or left open

AR, inventory, and DSO tests were flagged as needed but not pulled
Live short-interest data remained an estimate
Some mechanisms corroborated structurally, not line-item proven

Coverage 9.5 Accuracy 9.0 Sharpness 9.5 First single-pass version: B−. The shipped multi-pass engine is what closed the gap. Self-scored · rubric above

Case 02 Amprius Technologies vs. Manatee Research, short report, May 2026 · run blind Jul 2026

A forensic short thesis on a NYSE-listed battery maker. Verdict, run blind, independently surfaced all ten of the red flags the published thesis rested on, several reconstructed directly from SEC filings rather than taken from anyone's allegations.

Matched independently · 10 of 10

A related-party structure, reconstructed from board-interlock disclosures in the filings
The bill-and-hold revenue pattern, with receivables exceeding quarterly revenue in the company's own reported numbers
Customer-equipment seizure, executive litigation, insider selling, and dilution the thesis flagged

Added beyond the reference

An export-control and CFIUS collision risk
A full valuation workup against named peers
Cash-runway math and the competitive scale threat
Offsetting strengths the short thesis had no reason to include

Missed or left open

Four secondary specifics: a court-ruling detail, a small disclosed government investment, an image-provenance catch, one reconciliation line
The blind run's verdict was more cautious than the reference's conviction, because a primary-filings access gap made the engine refuse to echo what it could not verify. The gap is fixed; the re-run pulled the filings directly and hardened the call.

Coverage 9.0 Accuracy 9.0 Sharpness 8.0 Epistemic honesty 10 Self-scored · rubric above

Case 03 Mercor vs. a Forbes investigation, Apr 2026 · run blind Jul 2026

An investigative feature on a private, multi-billion-dollar AI staffing company: the hardest test, because private companies have no filings to lean on. Verdict, run blind, surfaced seven of the nine threads the investigation reported, and the two it missed are named below rather than smoothed over.

Matched independently · 7 of 9

The contractor-fraud reporting and the security-infiltration reporting the investigation covered
A reported data breach and a major customer fallout
Customer concentration, the gross-vs-net revenue framing, and the valuation timeline

Added beyond the reference

A take-rate sensitivity table with the exact trigger that flips the thesis
A coupled kill-condition the reporting had not connected
A trade-secret lawsuit and a margin tautology in the company's own framing

Missed, and admitted

The culture and attrition thread, even though it sat in the engine's own sources
The cofounder-discord thread
One date error survived the critique pass and was caught only in cross-reference

Coverage 9.0 Accuracy 8.0 Sharpness 9.0 Calibration 9.0 Self-scored · rubric above

What the misses changed

Every miss above became a pipeline stage.

The point of testing before selling was not the grade. It was that each documented failure forced a named, permanent fix in the engine that every client report now runs through.

Whole-thread miss (Mercor culture thread, present in its own sources)

FixA theme-mining pre-pass now inventories every narrative thread in the collected sources before writing begins, and the report is checked against that inventory.

A date error surviving critique (one wrong year in the Mercor run)

FixA deterministic consistency linter now blocks delivery until every date, entity, and number agrees across the report. It was mutation-tested against this exact bug.

A bundled critique going soft (fact-checking and sharpness competing in one pass)

FixThe critique was split into two independent passes: one attacks factual grounding, one attacks analytical sharpness.

Primary-source access gap (blocked filings softened the Amprius verdict)

FixDirect primary-filings access was repaired and validated by re-running the case: the engine pulled the actual filings and confirmed the load-bearing signals from source.

Read this before trusting it

The limits of this test.

The grader is us. Scores are self-issued against the rubric above. Treat them as a disclosed self-assessment, not an audit. No third party has graded Verdict yet; the first one to do so replaces this caveat.
Three cases is three cases. The set grows with every backtest a client runs against a deal they already closed, and the rubric stays fixed so results stay comparable.
The references stay theirs. Where a reference alleges wrongdoing, those are the publication's allegations, and where a company has disputed them, nothing here takes a side. We measured signal recall, not courtroom truth.
Comparison is not affiliation. None of the referenced publications are connected to Verdict in any way, and none have endorsed it.

Now grade us yourself.

The full Anysphere sample is public: the same document every client receives. Or run the free backtest against a deal you already closed and compare it to what your team found.

Read the sample report Run the free backtest

One flat fixed quote per engagement · free launch reports: 1 of 3 claimed