Eval Suite
Gemini Scribe carries an agentic eval suite that measures how well a given LLM behaves as the system under test — not just whether it answers questions, but whether it picks the right tools, doesn't break the vault, and stays within budget. This page documents what the suite measures and how to read the results.
For the operator-facing harness reference (running sweeps, blessing baselines, adding new tasks), see evals/README.md in the repository.
Why the suite exists
A chat-quality eval (does the model write good prose?) misses most of what matters for an Obsidian agent. The harness was designed around a different question: given a vault and a user request, does the agent reliably do the right thing? "Right" here means: read the correct files, call the right tools in the right order, modify the vault only when asked, and produce a response that actually satisfies the user's intent.
The suite is intentionally a gradient, not a saturated yes/no benchmark. Tasks span four difficulty tiers so it separates model classes (frontier Gemini tiers vs. open Ollama models, capability ceilings vs. cost-efficient defaults) rather than maxing out at 100% on every model.
Task catalog and difficulty tiers
The suite currently has 54 tasks across four tiers:
| Tier | Intent |
|---|---|
| T1 | Easy — single tool call, tiny corpus. Regression canary; every model should pass. |
| T2 | Moderate — 2–3 tool calls, light distractors. |
| T3 | Hard — multi-hop reasoning, many distractor files, careful tool sequencing. |
| T4 | Hardest — long context, ambiguity resolution, refusal-vs-fabrication tradeoffs. |
Each task is a JSON definition in evals/tasks/ with a fixture (the synthetic notes seeded into the vault before the agent runs), a user message, expected/forbidden tool sets, and a rubric describing what counts as a correct outcome.
Scoring: pass vs solve
The harness scores two things per task run:
pass— the run completed without harness errors and within the timeout. Effectively a liveness check.solve— the run passed and satisfied the full rubric: all required tools were called, no forbidden tools fired, output matchers held, and the state-based vault assertions held.
solve is the headline number — it's the one that says "the agent actually did the job."
Output matchers
Three matcher types check the agent's final response:
contains— substring match (case-sensitive by default, supports any-of arrays for wikilink-vs-title style variation).regex— JavaScript regex with explicitflags. Inline(?i)-style flags are not supported by JS regex; passflags: "i"separately.judge— LLM-as-judge for prose-heavy rubrics where literal substrings would be too brittle. See Judge model below.
Vault assertions (state-based scoring)
fileExists / fileContains / fileMatches / fileLacks / fileUnchanged / frontmatterEquals checks run against the post-task vault state. This is how write/edit/delete tasks are scored: not by what the agent said it did, but by what's actually on disk after the run.
Tool-call budget
A task can declare a toolCallBudget. Exceeding it makes solve false even if every other criterion held. Catches "read every file in the vault" behaviour that a more efficient tool would have answered in one call.
Reliability: pass^k and solve^k
Each task runs k times (typically k=3 for development, k=5 for publication-grade baselines). Two metrics fall out:
pass^k/solve^k— the τ-bench reliability signal (arXiv 2406.12045): a task counts as passed/solved at k only when all k runs passed/solved. This is the noise-free number — LLM nondeterminism on a single run can't inflate it.- Mean rates — proportion of all task × run cells that passed/solved. Useful signal, noisier.
Tasks that land between 0 and k solves are flagged as flaky. A small number of flaky tasks isn't a regression — it's a property of the LLM and the task — but the trend matters: a previously-stable task drifting flaky is a real signal.
Judge model
Prose-heavy rubrics use an LLM-as-judge instead of literal matchers. The judge is a separate model from the system under test:
- It always uses Gemini, even when the system under test is Ollama, so the judge doesn't drift across model-swap experiments.
- The current standardized judge is
gemini-3.5-flash, pinned (no-latest/-preview). It was selected against a hand-labelled gold set of 90 prose-judge tuples (a one-time human calibration committed in the repo): 94.4% agreement with human ground truth, vs. 92.2% forgemini-2.5-flash(the previous default) and 93.3% forgemini-3.1-flash-lite. The accuracy ceiling under measurement wasgemini-3.1-pro-previewat 95.6%, but a-previewid would have made every blessed score subject to silent re-rating if Google rotated the underlying weights. - The judge runs with
temperature: 0and a strict YES/NO contract.
Bias caveats
The judge is blind to the model id — the prompt carries only the user request, the agent's response, and the criterion, never the name of the model that produced the response. Blindness removes explicit identity bias, but does not eliminate latent stylistic-familiarity bias. Concretely: a Gemini judge grading a Gemini-family system under test (the case for every row in the table below except gemma4) is exposed to same-family stylistic preference. The current judge is the same vendor as most of the system-under- test set; results across vendor lines (Gemini-judged Ollama vs Gemini-judged Gemini) should be read with that caveat.
A cross-vendor judge is the cleanest fix and is straightforward to revisit if a future calibration round shows persistent same-family bias.
Baselines
A baseline is a blessed, committed result for a (provider, model) pair. It's not auto-promoted: the operator explicitly runs npm run eval:bless after inspecting a clean run, and commits the resulting evals/baselines/<provider>-<sanitized-model>.json. Subsequent runs auto-compare against the matching baseline and flag regressions in pass^k or solve^k.
Baselines pin to specific model ids — never -latest or -preview. Those tags can rotate underneath us silently, which destroys the regression signal.
Published results
Every model that's been blessed against the current 54-task suite. Rows are sorted by solve^k (the headline reliability number) descending. The Commit column links to the SHA the harness was built from when the sweep ran; the Date column is the sweep's ISO timestamp (UTC).
| Model | Provider | k | Tasks | pass^k | solve^k | T1 | T2 | T3 | T4 | Date (UTC) | Commit |
|---|---|---|---|---|---|---|---|---|---|---|---|
gemini-3.1-flash-lite | gemini | 5 | 54 | 100.0% | 74.1% | 3/3 | 11/13 | 19/29 | 7/9 | 2026-05-24 | 36e3495 |
gemini-2.5-flash | gemini | 5 | 54 | 98.1% | 57.4% | 3/3 | 10/13 | 13/29 | 5/9 | 2026-05-24 | 4683f4f |
gemma4:e4b | ollama | 5 | 54 | 90.7% | 14.8% | 3/3 | 2/13 | 2/29 | 1/9 | 2026-05-24 | 8ece524 |
Reading the table
A few patterns worth calling out from the current rows:
- The "lite" label is misleading.
gemini-3.1-flash-lite(74.1% solve^5) materially beatsgemini-2.5-flash(57.4%) on the same 54-task suite and the same judge — a ~17pp gap. The newer flash-lite is the more capable agentic model despite the name. - The T1 → T4 gradient is doing real work. Compare
gemma4:e4b(100% on T1, then 15% / 7% / 11% through T2–T4) againstgemini-3.1-flash-lite(100% / 85% / 66% / 78%). T1 is a regression canary that every plausible agent should clear; T2–T4 is where the open-model tier and the frontier tier separate.
Adding a model to the table
The table is generated from evals/baselines/*.json at docs-build time, so publishing a new model is the same operation as blessing a baseline. From the repo:
# Pin the model id — do not use -latest or -preview ids
npm run eval -- --model=<id> --repeat=5 # or --provider=ollama --model=<id>
npm run eval:bless # promotes the most recent result
git add evals/baselines/<provider>-<sanitized-model>.json
git commit -m "chore(evals): bless baseline for <id>"The docs build picks up the new file automatically — no template edits.