Audit tool: architecture & flow – White Tree Digital Docs

The Free Website Audit at /free-website-audit is the studio's highest-intent lead path: a visitor drops a URL, the Worker runs a real multi-signal audit, and the page shows just enough of the verdict to make withholding the rest worth an email — which is the only thing that lets the full report through.

This page is the map: how a request flows from the visitor's URL to a HubSpot-emailed report, the three decoupled layers that produce the score and findings, and why the whole thing runs server-side. The per-check logic, the score formula, and the teaser copy rules are documented in Audit tool: checks, scoring & teaser; the HubSpot submit mechanics live in HubSpot & lead capture.

The mental model

The audit is not a report card — it is a conversion surface. The page's only job is to get the visitor to submit the email gate; the real value is delivered in the report HubSpot emails after the submit. Everything below serves that: the on-page teaser shows a single 0–10 score and exactly three consequence-framed findings (each with a locked "N issues" count behind a padlock), and nothing else. The fixes — what each issue is and how to resolve it — are the freebie, and the freebie is gated.

Two design moves make this work and keep it maintainable:

The audit runs in a Cloudflare Worker, not the browser. A single POST /api/audit fetches everything, runs all 26 checks, and returns only the teaser shape. The gated detail never crosses the wire, so a technically literate visitor inspecting the network response sees the symptom and a locked count — never the answer.
Authoring is per-check, never per-site. You never enumerate "all possible site outcomes." Each check is authored once with copy for its own states; the site-level output is just whichever checks fired. This kills the combinatorial problem and makes adding a signal cheap.

This replaced a thin PSI wrapper

The first version (see prompts/website-audit-tool-spec.md) was fully client-side: it called PageSpeed Insights from the browser with a PUBLIC_PSI_API_KEY and rendered the four Lighthouse category grades on screen. That read as "just PSI" — a literate visitor recognizes the Lighthouse four-category grid on sight and self-serves it. The rebuild (prompts/claude-code-prompt-audit-signals.md) moved the audit server-side, made the PSI key a runtime secret, switched the on-page output to a single 0–10 score plus three findings, and renamed the HubSpot properties to audit_*. The old spec still lives in the repo as historical context — don't build to it.

The end-to-end flow

visitor enters URL
  → POST /api/audit  (src/pages/api/audit.ts — a CF Worker route, prerender=false)
      · same-origin Origin check + per-IP rate limit (5 / 10 min, per-isolate)
      · fetchSiteData(url, PSI_API_KEY): PSI + HTML/headers + http→https probe
        + robots.txt + sitemap.xml — all fetched ONCE, in parallel
      · runChecks(site): 26 pure checks over SiteData
      · buildTeaser(results, host): score + select top 3 findings
  → returns ONLY the AuditTeaser JSON — full RanCheck[] stays in the Worker
  → email gate (the single CTA on the results screen)
  → submitToHubspot: Forms v3 API, email + audit_* fields
  → a HubSpot workflow emails the full report

The browser side is a single React island, WebsiteAudit.tsx, hydrated client:load on the otherwise-static Astro host page free-website-audit.astro. The host page is thin: it fetches siteSettings for the layout chrome and renders <WebsiteAudit client:load />. All the logic — the state machine, the fetch, the progress simulation, the email gate, the HubSpot submit — lives in the island and its sibling modules under src/components/audit/.

The island is a four-state machine: idle → auditing → (results | error). Submitting the URL form normalizes the input (normalizeUrl, which strips to the origin homepage and rejects IPs/localhost/scheme-less garbage) and moves to auditing; the single POST /api/audit round-trip resolves to either results (with the teaser) or error. Validation failures and errors return to idle with inputs preserved.

Why the audit is one server round-trip

The entire audit is one fetch to /api/audit. PSI dominates the wall-clock time (a desktop Lighthouse run is typically 20–40 s), and the Worker waits on it plus the HTML/probe fetches before responding. The browser never sees the individual sources — it sends a URL and gets back a teaser.

This is deliberately the inverse of the original client-side design, and it's what lets the gate exist: if the browser fetched PSI and parsed it, every signal would be in the page's memory and reverse-engineerable. Moving it server-side means the only bytes that reach the client are the teaser.

The three decoupled layers

The audit's core is three layers, each independent of the others. You author checks; score and selection are mechanical consequences.

Layer 1 — Checks (raw signals)

A Check is a pure function over SiteData — it fetches nothing and must not throw. SiteData is fetched once and holds every raw signal: the PSI response, the parsed HTML, the response headers, the http→https probe result, and robots/sitemap reachability. Each check derives a status (pass | warn | fail | unknown) from data that's already there.

export interface Check {
  id: string;
  group: CheckGroup; // speed | tracking | security | visibility | stack
  /** 1-10. Drives BOTH score weight and teaser ranking — one knob, no separate severity. */
  impact: number;
  /** Line items this expands into in the email report (drives the locked count + rollup). */
  subIssues: number;
  reportOnly?: boolean; // runs + feeds the report, but never occupies a teaser slot
  /** Pure; derives status from already-fetched SiteData. Must not fetch or throw. */
  run(site: SiteData): CheckResult;
  /** 'pass' copy is optional — used only in the email report, never on the page. */
  copy: Partial<Record<Status, CheckCopy>>;
}

There are 26 checks today, registered in registry.ts by spreading five group files (speed 7, tracking 4, security 3, visibility 9, stack 3). The registry is the single source of truth for which checks exist; order is irrelevant because score and selection both sort by impact. Adding a signal means adding it to its group file — nothing else changes.

Layer 2 — Score (0–10)

The score is impact-weighted over only the checks that actually ran. pass = 1, warn = 0.5, fail = 0; unknown is excluded from both numerator and denominator.

const ran = results.filter((r) => r.result.status !== 'unknown');
// score = 10 · Σ(impact · points[status]) / Σ(impact)  over ran, rounded to 1 decimal

The score carries the severity signal on its own — a site full of failures lands near 2 without the page having to list twenty cards.

Layer 3 — Selection (the on-page teaser)

Selection decides which findings surface. Fails and warns are sorted fail-before-warn, then highest-impact-first, and the top 3 become the teaser — always exactly 3 on the page. reportOnly checks (like platform) never occupy a slot. Scale is communicated by the score and a rollup count (+N more in the full report), never by adding cards: a clean site shows 8.9 · 2 minor things, a disaster shows 2.1 · 3 worst findings · +20 more — same layout, neither underselling nor manufacturing problems.

Pass results never appear on the page. The page is asymmetric on purpose: it is a pain surface. Passes are reassurance and live only in the email report.

The gate lives in the type, not in a filter

TeaserFinding — the only shape that crosses the wire — deliberately carries no id, status, value, fix, or pass copy. It's {group, headline, line, subIssues}. Because the gate is structural (the wire type simply has no field for the gated detail), there's no if (production) branch to forget. The full RanCheck[] stays inside the Worker. /api/audit returns json(teaser, 200) — the comment in the route literally reads "ONLY the teaser shape leaves the Worker." The detail behind the lock is Phase 2 (the email report); it isn't built into the teaser, so it can't leak.

This decoupling is the payoff of authoring per-check: adding a 27th signal is a single new Check in a group file. The score reweights itself, selection re-ranks, and the teaser keeps showing three. No site-level wiring.

SiteData — fetched once, server-side

fetchSiteData(url, psiKey) runs everything in parallel and never throws — every source degrades to null/false on failure, and its dependent checks return unknown (dropped from the score, never surfaced):

const [psi, html, redirect, robots, sitemap] = await Promise.all([
  fetchPsi(url, psiKey),       // PSI v5, desktop, 4 categories, 45s timeout
  fetchHtml(url),              // raw HTML + lowercased headers, 8s timeout
  probeHttpToHttps(url),       // http→https redirect probe, 4s timeout
  reachable(robotsUrl),        // /robots.txt
  reachable(sitemapUrl),       // /sitemap.xml
]);

A few decisions are load-bearing:

HTML is parsed with portable regex/string scans, not HTMLRewriter. The exact same parser runs in the Worker, in miniflare dev, and in plain Node for fixture generation — no runtime-specific dependency and no dev-parity risk. parseHtml is shallow, tolerant presence/counting extraction; it is never a DOM.
A real Chrome User-Agent is sent. Many sites serve a bot-challenge page (or strip markup) to non-browser agents, which would corrupt every HTML check — the audit wants the same HTML a visitor's browser receives.
requestHosts is the verifiable tracking source. It's the distinct hostnames of every request the rendered page made, pulled from PSI's network-requests audit. This catches tags injected by GTM or a CMS that never appear in the static HTML — a static-only fetch can't see those, so the rendered request list is authoritative. null means PSI didn't run.
Probe semantics: null vs false matter. For the reachability and redirect probes, null means the probe itself broke; false means a definite negative (404, or served over plain http). Checks read this distinction so a broken probe degrades to unknown rather than over-failing a working site.

If neither PSI nor HTML loaded, the route returns 502 unreachable — an unreachable site is the tool's failure, not an audit finding. Never report "we couldn't reach your site" as a low score; that reads as the tool being broken.

The Worker route guards

POST /api/audit (src/pages/api/audit.ts) is export const prerender = false, so it stays an on-demand Worker route even in the static public build. It carries three guards before it spends a PSI quota unit:

1. Same-origin Origin check. The widget always posts same-origin, so a missing or foreign Origin header is a script, not a visitor. It's spoofable by a determined attacker but stops the naive curl loop that burns the quota. (Skipped in dev.)

2. Per-IP rate limit. A sliding window keyed on cf-connecting-ip: 5 audits per 10 minutes. Each audit triggers a 30–60 s quota'd PSI run, so without a throttle a trivial script exhausts the quota in minutes and the lead magnet starts erroring for real prospects.

const RATE_WINDOW_MS = 10 * 60_000;
const RATE_MAX_PER_WINDOW = 5;

The in-code rate limit is per-isolate, not a guarantee

The hits Map lives in Worker isolate memory, so the limit is a bar-raiser, not a hard backstop — Cloudflare can spin up multiple isolates, each with its own counter. The real backstop is a Cloudflare WAF rate-limiting rule on POST /api/audit, created at launch (see _LAUNCH-RUNBOOK.md). Don't treat the in-code throttle as the only line of defense.

3. The PSI key is a runtime secret. The route reads env.PSI_API_KEY from cloudflare:workers (sourced from .dev.vars locally and an encrypted CF Secret in production), falling back to import.meta.env.PSI_API_KEY. A missing key returns 500 config. See Environment variables for the full secret inventory.

The PSI key restriction must be "None"

Because the key is now used server-side from the Worker (no Referer header is sent), its Google Cloud restriction must be set to None — a referrer restriction left over from the old client-side design would make every PSI call fail. Never put the key in .env; it's a .dev.vars / CF Secret value. Document the var name (PSI_API_KEY), never the value.

The browser side: honest progress and the email gate

PSI gives no progress signal, so the progress bar is an honest, monotonic simulation while the single round-trip is in flight (PSI dominates it). It runs in three bands — ramp to 25% over ~1 s, ease toward a 74% cap while waiting (it holds at the cap rather than sitting on a frozen integer or hitting 100% prematurely), then a fast 74→100% finish once the teaser resolves. If PSI resolves faster than the simulation, it jumps ahead. prefers-reduced-motion swaps the animation for a simple poll loop and stage-label updates. The bar never claims a time estimate and never reaches 100% before the data is back.

When the teaser arrives, the results screen renders the verdict (the colored score over 10), the three finding cards (each with its locked subIssues count), the optional +N more rollup, and the single CTA: the email gate. There is no second path — everything points to the form.

The email gate is the conversion point. On submit it runs a disposable-email check, then submitToHubspot. Two browser-side notes:

Both audit forms set suppressHydrationWarning

HubSpot's collected-forms script stamps data-hs-cf-bound on every <form> before React hydrates. Without suppressHydrationWarning on both the URL form and the email form, that benign attribute mismatch logs a hydration error on every page load. It's a footgun specific to HubSpot tracking being present site-wide.

The disposable-email blocklist is a 2.4 MB JSON loaded as a lazy async chunk (disposable.ts), never a static import — a static import would inline the whole list into the client:load island and ship it to every visitor on first paint. It's warmed with preloadDisposableList() the moment the email gate appears (the visitor is reading their teaser, so the list is usually resident before they finish typing), and the check fails open: a chunk-load failure must never block a real lead. This blocklist intentionally overrides the original spec's "no email checking" non-goal — it blocks throwaway providers only, never free providers like gmail/yahoo/icloud.

The HubSpot submit fires async on the results screen and submits only teaser-level data (email, audited_url, audit_score, audit_total_issues, audit_finding_1..3); a HubSpot workflow emails the report. The submit mechanics — Forms v3, the NA2-region 404 self-heal, the retry policy, and the required custom properties — are documented in HubSpot & lead capture.

Why this is the highest-intent lead path

A visitor who pastes their own URL into an audit tool is telling you they care about their site right now. That's a warmer signal than anyone reading a service page, and the funnel is built to capture it: the teaser proves the tool found real, plural, named problems (the locked counts), and the only way to learn what they are is to hand over an email. The curiosity gap — count visible, fixes gated — is the entire mechanism.

Operationally, that makes the tool's inbound links matter. Its only inbound traffic is whatever Studio content points at it (nav items, CTAs), so the project guide's rule is to keep at least one nav item on it. If nothing links to the highest-intent page, the funnel has no top.

Where this lives

Concern	File
Build spec (original, client-side — historical)	`prompts/website-audit-tool-spec.md`
Rebuild plan (server-side signal system)	`prompts/claude-code-prompt-audit-signals.md`
Funnel summary in the project guide	`website/CLAUDE.md` — "Free Website Audit funnel"
Astro host page (`client:load` island)	`website/src/pages/free-website-audit.astro`
React island: state machine, progress, gate	`website/src/components/audit/WebsiteAudit.tsx`
Worker route: guards, rate limit, teaser response	`website/src/pages/api/audit.ts`
Fetch-everything-once `SiteData` + HTML parser	`website/src/components/audit/siteData.ts`
Check registry (spreads 5 group files)	`website/src/components/audit/registry.ts`
Check core types (`Check`, `CheckResult`, `RanCheck`)	`website/src/components/audit/check.ts`
Pure orchestration (`runChecks`)	`website/src/components/audit/audit.ts`
Score formula	`website/src/components/audit/score.ts`
Teaser selection (top 3 + rollup + fallback)	`website/src/components/audit/select.ts`
Wire shape + `buildTeaser`	`website/src/components/audit/teaser.ts`
HubSpot submit (teaser-only payload)	`website/src/components/audit/hubspot.ts`
Disposable-email lazy chunk (fails open)	`website/src/components/audit/disposable.ts`

For the per-check internals, the score/teaser copy rules, and the fixtures, continue to Audit tool: checks, scoring & teaser. For the HubSpot property contract and lead capture, see HubSpot & lead capture.