How this is done honestly

Methodology

This is a research project, and the rigour is the point. Most AI demonstrations show a number; this shows the method behind it, including where it fails. Each section is written plainly first, with the formal detail one click away.

Design Science Research

This is a research project, not just an engineering one. It follows an established method for building something useful and proving rigorously that it works.

The technical depth

The work is framed as Design Science Research, mapped onto the six steps of Peffers et al. (2007) and the three cycles of Hevner (relevance, design, rigour). Banking interviews form the relevance cycle; the reconstructions and the build form the design cycle; the statistics, pre-registration and threats analysis form the rigour cycle. The primary contribution is positioned as an exaptation (the evaluation methodology), with the system as an improvement contribution that instantiates it.

Hevner et al. (2004)Peffers et al. (2007)Gregor and Hevner (2013)

Pre-registration

The hypotheses and the way success is measured are written down and dated before any experiment is run, so the results cannot be quietly reshaped to look good.

The technical depth

A frozen, dated pre-registration commits the hypotheses, the operationalised metrics, the analysis plan, and the stopping rules before the first FinanceBench run, to remove hypothesising after results are known (HARKing) and selective reporting.

Threats to validity

Three specific things can make AI-on-finance results look better than they are. Each is addressed head on.

The technical depth

Data contamination: benchmark filings likely sit in the model's training data, so a memorisation test perturbs the numbers and checks whether accuracy survives, preferring filings that post-date the model cutoff. Generation variance: results are reported across multiple seeds with confidence intervals, using paired tests (McNemar for paired right/wrong, paired bootstrap or permutation for graded scores) with multiple-comparison correction. Judge validity: any automated judge is validated against a hand-labelled stratified sample, reporting agreement (Cohen's kappa or Krippendorff's alpha) and its error rate.

Calibrated abstention

When the system says it is unsure and declines to answer, that judgement is itself measured, so 'I don't know' can be trusted.

The technical depth

Abstention quality is measured with Expected Calibration Error, Brier score, and reliability diagrams, with conformal prediction as the strong version: answers are given such that, with statistical coverage, they meet a chosen error bound.

Reproducibility under model non-stationarity

Because the underlying AI models keep changing, the core results are designed so a stranger can reproduce them from a clean copy of the code.

The technical depth

Model versions and dates are pinned, seeds fixed, and the environment captured. Core claims run on open-weight frozen checkpoints, with closed API models used only as a comparator, so the headline findings do not rest on a moving target.

The banking interviews

Conversations with banking leaders define the real problem, treated as proper qualitative data rather than anecdote.

The technical depth

A semi-structured protocol with a stated sampling rationale feeds a thematic analysis with an explicit coding scheme and member checking, positioning the whole study as a mixed-methods DSR design: qualitative problem definition and quantitative artefact evaluation.