Methodology

How Benchmarks Work

build.sh compiles llama.cpp to WebAssembly with WebGPU support via Emscripten + emdawnwebgpu, producing two WASM variants: JSPI (Chrome) and Asyncify (Firefox, Safari).
runner.js launches Playwright browsers and navigates to harness.html.
harness.js detects JSPI support and loads the correct WASM variant.
The GGUF model is downloaded from HuggingFace directly in the browser.
Inference runs via WebGPU (or CPU fallback) using llama.cpp's C API with greedy sampling for deterministic output.
Performance metrics are collected via llama_perf_context() and returned to Playwright.
A fresh browser instance is launched for each variant to prevent WASM memory accumulation (OOM fix).

Dashboard Columns

Column	Description
Machine	Machine slug identifying the hardware (e.g. `apple-m3-16gb-darwin`)
Model	Model name (e.g. Llama-3.2-1B-Instruct)
Quant	Quantization variant (e.g. Q4_K_M, Q8_0)
Size (MB)	Model file size in megabytes
Browser	Browser used for the benchmark (chromium, firefox, webkit)
Status	PASS if inference completed successfully, FAIL otherwise
Build	`jspi` or `asyncify` — which WASM variant was used. Chrome supports JSPI; Firefox and Safari use Asyncify.
WebGPU	Whether WebGPU was available in the browser. If not, inference falls back to CPU.
Decode tok/s	Token generation speed (tokens/sec) — main performance metric
Prefill tok/s	Prompt processing speed (tokens/sec)
n_eval	Number of tokens generated during decode
t_eval (ms)	Total decode time in milliseconds
n_p_eval	Number of prompt tokens processed during prefill
t_p_eval (ms)	Total prefill time in milliseconds
Wall (s)	Total wall-clock time for the benchmark run in seconds (includes model download, load, and inference)
CPU Match	Consistency with CPU baseline — percentage of token positions where WebGPU and CPU agree on the top-1 token. Only present when benchmarks are run with `--consistency`. See Consistency Measurement below.
Error	Error message and category (OOM, WASM Abort, Timeout, etc.) when status is FAIL

Error Categories

Category	Pattern	Typical Cause
OOM	out of memory, memory allocation	Model too large for available WASM memory
WASM Abort	wasm, abort, unreachable	WASM execution error, often from unsupported operations
Timeout	timeout, timed out	Benchmark exceeded time limit (model download or inference)
Download Failed	download, fetch, 404, network	Model file not found or network error
Other	everything else	Uncategorized errors

Consistency Measurement

The --consistency flag measures how faithfully the WebGPU backend reproduces the CPU computation for each quantization type.

How it works

For each variant, two runs are performed:

CPU baseline (n_gpu_layers=0): greedy-decodes 128 tokens and records the token ID sequence. Cached to results/cpu_baselines.json. When testing multiple browsers, the baseline is collected once on the first browser and shared across all browsers (CPU output is identical regardless of JSPI vs Asyncify). When testing a single browser, the baseline runs in that same browser.
WebGPU run (n_gpu_layers=999): performs a forced-decoding pass — feeds the CPU's token sequence one token at a time and checks whether the WebGPU backend independently predicts the same top-1 token at each position.

Why forced decoding

Naively comparing generated text suffers from cascading divergence: a single token difference changes the KV cache context for all subsequent tokens. Forced decoding evaluates each position independently, giving a clean per-token accuracy signal.

Interpreting CPU Match

CPU Match	Interpretation
`100.0%`	Numerically identical to CPU — no precision issues
`95–99%`	A few tokens differ due to near-equal logits — expected for lower-precision quants
`< 90%`	Systematic precision issues — GPU kernel may need investigation
`0.0%`	First token wrong — quantization kernel likely broken
`—`	No consistency data — benchmarks were run without `--consistency`