Building a Health Monitor Dashboard for WordPress Docs

I’ve always believed that a great developer experience starts with great documentation. When documentation is clear, accurate, and up to date, it helps developers succeed and builds trust in the product. But when it fails to guide users properly — or worse, contains outdated or inaccurate information — both the developer experience and confidence in the product quickly deteriorate.

The Block Editor Handbook has more than 200 pages of documentation that developers rely on every day. APIs change, parameters get renamed, code examples stop working — and the docs drift further from the code with every Gutenberg release. Some content is automatically generated from the code but other content is manually created and maintained. Nobody knows which pages are accurate and which are silently outdated.

That’s the problem Jonathan Bossenger and I have been working on for the last few weeks. The tool we’re building, the Block Editor Docs Health Monitor, compares each handbook page against the actual Gutenberg and WordPress source code, asks an LLM to flag the mismatches, and publishes a static dashboard a doc author can open to see exactly what’s wrong, page by page, ranked by severity, with the line of code that proves the finding.

Let me walk you through what it does, how it works, and some of the lessons we learned along the way.

Radical Speed Month: a month to ship something real at Automattic

A few weeks ago, Matt Mullenweg — Automattic‘s CEO — kicked off a one-month, company-wide experiment called Radical Speed Month (hashtag #radicalspeedmonth). The format is simple: pair up with one other person, pick a project that matters, ship it by May 22. No committee, no approval layers, no quarterly planning rituals in between. The premise is that small teams move faster than large ones, and the month is there to find out how much.

For me, that meant teaming up with Jonathan Bossenger. Jonathan is a DevRel at Automattic, a long-time WordPress contributor, and one of the best teachers and communicators I know in the WordPress ecosystem He also holds the record for the most videos published on WordPress.tv. Working with him is always a pleasure, so I’m always happy to have another chance to collaborate with him.

The project we picked: a docs-drift detector for the WordPress developer handbooks. Compare what the docs say against the code they describe, then put the gaps on a public dashboard so handbook editors, plugin authors, and learners can see where the documentation has fallen behind.

Week one was all definition: architecture, contracts, components, scope. The temptation to start coding immediately was real, and we ignored it. A solid spec is what lets you hand work to an autonomous agent and trust the result — and that turned out to matter more than we expected.

The vision

The final goal of the project is to have a single dashboard that audits every WordPress documentation site — the Block Editor Handbook, the
REST API Handbook, the Plugin Handbook, the Theme Handbook, and the rest — refreshed weekly. Doc authors open it on a Monday, see what drifted
over the weekend, and have a punch list ready before standup.

How feasible that is depends on the budget. At today’s frontier-model pricing, a weekly run across every WordPress handbook may not be feasible without a strong sponsorship.

But the last word about cost hasn’t been written. We’re actively working to bring the per-run bill down — building support for open-weight models that can run locally on a developer’s own laptop or through lower-cost hosted providers — so a weekly all-handbooks run can fit within a normal project budget. Frontier API prices keep dropping and open-weight models keep catching up on their own, too: we don’t need to solve cost in one stroke; we just need to outlast it. More on this in the On cost section below.

💡 The dashboard is the destination, not the gate. The tool is already useful right now for individuals: clone the repo, scope it to the docs
you care about, run it with your own API key — that’s the validation step the project is in today. The “single dashboard for all of WordPress”
version comes once the weekly economics work for everyone.

How the Docs Health Monitor works

At a glance, the pipeline is just three steps: fetch the docs, fetch the code, ask an LLM to compare them. The interesting part is what sits between those steps to make the output trustworthy.

There’s very detailed info about several aspects of the project at the repo’s docs folder

Three adapters, one pipeline

Everything that touches an external system is behind an interface: DocSource, CodeSource, DocCodeMapper. The pipeline never imports a concrete implementation directly. It reads a config file, looks up the adapter type, and wires the pieces together at runtime:

{
  "site": "gutenberg-block-api",
  "docSource": {
    "type": "manifest-url",
    "url": "https://raw.githubusercontent.com/WordPress/gutenberg/trunk/docs/manifest.json",
    "parent": "reference-guides/block-api"
  },
  "codeSources": [
    { "name": "gutenberg", "type": "git-clone", "repo": "WordPress/gutenberg", "ref": "trunk" },
    { "name": "wordpress-develop", "type": "git-clone", "repo": "WordPress/wordpress-develop", "ref": "6.9" }
  ],
  "mapper": { "type": "manual-map", "path": "mappings/gutenberg-block-api.json" },
  "validator": { "type": "claude" }
}

{
  "site": "gutenberg-block-api",
  "docSource": {
    "type": "manifest-url",
    "url": "https://raw.githubusercontent.com/WordPress/gutenberg/trunk/docs/manifest.json",
    "parent": "reference-guides/block-api"
  },
  "codeSources": [
    { "name": "gutenberg", "type": "git-clone", "repo": "WordPress/gutenberg", "ref": "trunk" },
    { "name": "wordpress-develop", "type": "git-clone", "repo": "WordPress/wordpress-develop", "ref": "6.9" }
  ],
  "mapper": { "type": "manual-map", "path": "mappings/gutenberg-block-api.json" },
  "validator": { "type": "claude" }
}

That config decides everything. Swap manifest-url for a future fs-walk adapter and you point the same pipeline at a folder of markdown files. Swap the mapper, swap the validator — the rest doesn’t move.

One non-obvious detail in the code-source layer: each repo pins to a specific ref, deliberately. wordpress-develop points at the current WP release branch, not trunk. The handbook’s readers are running shipped WordPress, not yesterday’s commit. If you compare docs against trunk, you flag every unreleased refactor as drift and miss every API that hasn’t shipped yet. The ref has to match what users actually run.

Mapping docs to code

The validator needs to know, for any given doc, which source files are relevant. The doc explaining block.json metadata maps to the block registry implementation, the schema files, the PHP-side resolver. The doc on dynamic blocks maps to a different set of files entirely.

That mapping is the single biggest factor in the quality of the analysis. Get it wrong and the LLM either guesses or hallucinates. Get it right and most of the work is already done.

On an initial version of the project mappings were hand-written. They now have a first-pass automated builder:

scan each doc for symbols
run AST search across the codebase
score candidate files by symbol specificity and centrality
ranks the mapping with semantic understanding using an AI model
and emit a tiered list (primary / secondary / context) for every doc.

A human still reviews the result before committing, but the starting point is good enough that most entries pass through untouched.

The two-pass validator

The validator is the core of the tool, and the part I spent the most time tuning. It runs in two passes per doc, with a deterministic pre-pass and a deterministic gate in between.

⚠️ Without a deterministic gate, LLMs invent quotes, paraphrase the doc back at you as if it were a contradiction, and flag style preferences as bugs. Early runs sat at 50–60% precision before the verbatim check was added. I realized that for this project, plumbing matters more than the prompt.

Pre-pass — absence detection.

Before any LLM call, the pipeline extracts every backtick-wrapped identifier from the doc and checks whether each one appears anywhere in the assembled source. The ones that don’t appear get injected into the Pass 1 prompt as “potentially removed APIs.” This exists because the model can otherwise miss removed APIs entirely — there’s no code to quote as counter-evidence when something simply isn’t there, so the prompt has to seed the suspicion.

Pass 1 — candidate generation.

A single Claude call receives the full doc plus its tiered, token-budgeted code context (exported symbols, hooks, defaults, schemas, source tiers, and the pre-pass hint).

For example, here you can see the information collected and compiled for the Metadata in block.json doc (using its related mapping info) that is sent to Pass 1 that includes:

the original doc content
the symbols detected in the doc-site related repos
related hooks and filters
relevant schemas for the doc
source code of relevant code files (as per the site’s mapping file matching doc<>code)

The call forces a report_findings tool call with a fixed JSON shape — no free prose path exists. Each candidate must include a verbatim codeSays quote from a named source file and a verbatim docSays quote from the doc.

Immediately after the call, a deterministic gate verifies both quotes appear literally in their claimed locations (after normalising whitespace and Markdown links). Anything that fails is dropped silently and counted. The gate is what makes the difference between claim and finding — it catches the fabrications the prompt missed, and the model knows it exists, so it’s incentivised to either find a real quote or skip the finding.

Pass 2 — targeted verification.

Each surviving candidate gets its own bounded agentic loop. The model sees only the candidate JSON; it then drives its own context-gathering by calling fetch_code(repo, path, startLine, endLine) to inspect whatever region it wants, and commits a verdict via report_findings. Most verifications finish in 2–3 round-trips; the loop is capped at 10 turns.

See diagram in excalidraw

Scoring the health of docs

Each doc gets a health score:

score = 100 − (critical × 15 + major × 7 + minor × 2)
// clamped to 0..100
//   ≥ 85  → healthy
//   60–84 → needs attention
//   < 60  → critical

score = 100 − (critical × 15 + major × 7 + minor × 2)
// clamped to 0..100
//   ≥ 85  → healthy
//   60–84 → needs attention
//   < 60  → critical

The corpus as a whole also gets one Overall Health number, displayed as the headline on the dashboard. It’s the rounded arithmetic mean of every doc’s score in the run, with one carve-out: docs that couldn’t be analyzed — either no code was mapped to them, or the validator failed mid-flight — carry a null score and are excluded from the average. They neither help nor hurt the corpus number. An empty corpus defaults to 100.

The Dashboard

The dashboard is static HTML with Tailwind via CDN. No framework, no build step. A run produces an index.html corpus overview (Overall Health, totals, foldered tree), one page per folder, and one per analyzed doc listing its findings with severity, suggested fix, and verbatim quotes from both sides. Today it’s a local-only flow: npm run analyze writes the dashboard to a directory of your choice.

Next is publishing. The gh-pages branch will serve as both static host and append-only data store, with each run landing at data/runs/<runId>/results.json indexed by data/history.json and driven by a weekly cron. That unlocks the historical UI — a trend line of Overall Health, “new this run” badges, per-doc sparklines. Snapshot first, trends later. A trend line on top of a noisy validator just animates the noise.

Not just the Block Editor

The whole reason the adapter pattern is in there is that this engine isn’t meant to be Gutenberg-specific. The Block Editor Handbook is the first test case — large enough to be a real workout, small enough to be tractable in a month — but nothing about the pipeline assumes Gutenberg.

Any docs-and-code pair works as long as you can write the four config fields:

Where the docs live. A manifest, a folder of markdown, a REST endpoint — that’s the DocSource.
Where the code lives. One or more git repos pinned to specific refs — that’s the CodeSource list.
Which code each doc relates to. A mapping, hand-written or auto-generated.
A small site-specific prompt extension. One file that calibrates the generic validator to the conventions of this corpus.

That last one matters: the validator’s system prompt is layered. The base prompt holds rules that apply to any docs/code pair (the impact filter, severity rubric, evidence requirements, drift definitions). Language packs sit on top — JS/TS conventions, PHP conventions — and apply to any corpus written in that ecosystem. Then a one-file site extension carries the genuinely corpus-specific knowledge: a single-symbol carve-out, a schema-authority declaration, the list of true and false positives earned from real reviews of this docs set.

💡 In practice this means pointing the same tool at, say, the REST API Handbook, the Plugin Handbook, or your own internal SDK docs is mostly a matter of writing a config file, auto-generate the mappings and adding a custom prompt extension. The pipeline doesn’t change.

Lessons learned

Precision matters more than recall. IMO, doc authors stop trusting the tool after two false positives. A missed finding is forgivable. A cried-wolf finding kills adoption. Early runs sat at 50–60% precision and felt unusable. Once it crossed 80%, the same tool started feeling like a colleague rather than a heckler. So the dial to tune is the one that drops marginal cases, not the one that catches them.

The biggest wins were deterministic, not prompt-side. What pushed precision from the low 60s to 83% wasn’t a smarter prompt. It was a stack of small structural guards: a verbatim quote check after every LLM call, structured extractors that hand the model parsed signatures instead of raw source, cross-section verification, an anti-rephrase rule, and a missing-symbol pre-pass. On the run that hit 83% (15 true positives, 3 false positives), the verbatim check alone dropped 9 hallucinated quotes. Without it, precision would have collapsed. When you can verify a model’s claim with code, do. Don’t try to talk it out of lying with prompt rules.

On cost

Take a docs site with around 100 pages — a fair reference for a single product handbook, a developer guide, or a mid-sized API reference. At the early benchmarks (~$5 per 10 docs), a full-corpus run over those 100 pages would have an estimated cost of around $50 using Claude Sonnet 4.6 on both Pass 1 and Pass 2, before prompt caching kicks in. Cadence is then a knob: weekly when the budget allows, monthly when it doesn’t. Drift has a long enough half-life that even a monthly audit catches problems before they cement into the docs.

⚠️ Don’t point this at a 1,000-page corpus on default settings unless you mean to — the bill scales linearly with doc count. Start by scoping a folder or subtree, run weekly on that, expand once you trust the output.

The simplest lever today is scope. The tool runs locally with your own API key, and the config points it at a folder, subtree, or doc-pattern of your choice. A team scoping the weekly run to 20 of those 100 pages — the high-traffic ones, the ones with the most code samples, the ones authors actually own — pays for 20 pages of cost. That makes it even more useful right now without committing to the full bill.

The lever we’re most interested in is open models. Pass 2 was originally designed as an agentic loop — Claude requesting specific code regions via a fetch_code tool. It works on frontier models, but it may not be structurally necessary: the code is already cloned locally, and Pass 1 hands Pass 2 a candidate with a precise file + line range.

There’s active work on PR #51 t to remove the tool-call dependency end-to-end: Pass 1 emits plain JSON in place of a tool call, and the Pass 2 fetch_code loop is dropped — the existing verbatim gate and confidence threshold carry the hallucination check on their own. Once it lands, the validator can run against anything that speaks an OpenAI-compatible API.

That opens two paths: small open-weights models on a developer laptop via Ollama (Llama, Mistral, Qwen at 7B–13B), and larger open-weights checkpoints — too big to fit on a laptop — accessed through OpenRouter at a fraction of frontier API pricing. If a laptop model hits comparable quality, the $50-per-run figure drops to ~zero. If only a hosted open model does, it drops to a small fraction of today’s bill — still a big win.

Wrapping up

That’s where it sits today: a working engine that pulls the Block Editor Handbook and the Gutenberg source, compares them with a two-pass validator, and produces an evidence-backed health report at around 83% precision. The same engine, with a new config and a short site-specific prompt, should point at the next handbook with no pipeline changes.

The code is open at github.com/juanma-wp/wp-docs-health-monitor. If you’ve thought about this kind of drift problem on your own docs, or you’d like to point it at something other than the Block Editor Handbook, the discussions are open and we’d love to hear what your config would look like.