In the open

Build log

One entry per repository interrogated and per thing built. Each leads with what it is and why it matters, then opens into the technical detail. Everything links to a public reconstruction.

Original buildWeek 1

The evaluation harness comes first

Before improving anything, you need an honest way to measure it. This entry builds the scoreboard the whole project is judged on.

The technical depth

The pain
Most AI projects report a single benchmark number, which is one sample and easy to fool yourself with.

The ignored property
Standard reporting ignores that a system can fail at finding the right evidence, at reasoning over it, or at knowing when to stop. Those are different failures and need measuring separately.

The rebuild

The harness computes retrieval metrics (recall@k, nDCG, MRR) separately from answer quality, runs over a labelled set drawn from real residual cases, and is built to support multiple seeds and confidence intervals later in the rigorous-evaluation stage.

View the reconstruction repository

Comments

No comments yet.

ReconstructionWeek 2

Rebuilding ColBERT's late interaction from scratch

A study of why keeping detail at the word level beats squashing a passage into a single point, rebuilt by hand to understand it rather than to use it.

The technical depth

The pain
Single-vector embeddings collapse a whole passage into one point and lose fine-grained matches.

The ignored property
They ignore that relevance often hinges on a few specific terms; pooling averages them away.

The rebuild

A minimal MaxSim scorer keeps a vector per token and scores by summing, for each query token, its best match across the passage. The rebuild is then validated against a maintained implementation.

View the reconstruction repository

Comments

No comments yet.