Back to Dashboard
Methodology
How Benchmarks Work
build.shcompiles llama.cpp to WebAssembly with WebGPU support via Emscripten + emdawnwebgpu, producing two WASM variants: JSPI (Chrome) and Asyncify (Firefox, Safari).runner.jslaunches Playwright browsers and navigates toharness.html.harness.jsdetects JSPI support and loads the correct WASM variant.- The GGUF model is downloaded from HuggingFace directly in the browser.
- Inference runs via WebGPU (or CPU fallback) using llama.cpp's C API with greedy sampling for deterministic output.
- Performance metrics are collected via
llama_perf_context()and returned to Playwright. - A fresh browser instance is launched for each variant to prevent WASM memory accumulation (OOM fix).
Dashboard Columns
| Column | Description |
|---|---|
| Machine | Machine slug identifying the hardware (e.g. apple-m3-16gb-darwin) |
| Model | Model name (e.g. Llama-3.2-1B-Instruct) |
| Quant | Quantization variant (e.g. Q4_K_M, Q8_0) |
| Size (MB) | Model file size in megabytes |
| Browser | Browser used for the benchmark (chromium, firefox, webkit) |
| Status | PASS if inference completed successfully, FAIL otherwise |
| Build | jspi or asyncify — which WASM variant was used. Chrome supports JSPI; Firefox and Safari use Asyncify. |
| WebGPU | Whether WebGPU was available in the browser. If not, inference falls back to CPU. |
| Decode tok/s | Token generation speed (tokens/sec) — main performance metric |
| Prefill tok/s | Prompt processing speed (tokens/sec) |
| n_eval | Number of tokens generated during decode |
| t_eval (ms) | Total decode time in milliseconds |
| n_p_eval | Number of prompt tokens processed during prefill |
| t_p_eval (ms) | Total prefill time in milliseconds |
| Wall (s) | Total wall-clock time for the benchmark run in seconds (includes model download, load, and inference) |
| CPU Match | Consistency with CPU baseline — percentage of token positions where WebGPU and CPU agree on the top-1 token. Only present when benchmarks are run with --consistency. See Consistency Measurement below. |
| Error | Error message and category (OOM, WASM Abort, Timeout, etc.) when status is FAIL |
Error Categories
| Category | Pattern | Typical Cause |
|---|---|---|
| OOM | out of memory, memory allocation | Model too large for available WASM memory |
| WASM Abort | wasm, abort, unreachable | WASM execution error, often from unsupported operations |
| Timeout | timeout, timed out | Benchmark exceeded time limit (model download or inference) |
| Download Failed | download, fetch, 404, network | Model file not found or network error |
| Other | everything else | Uncategorized errors |
Consistency Measurement
The --consistency flag measures how faithfully the WebGPU backend reproduces the CPU computation for each quantization type.
How it works
For each variant, two runs are performed:
- CPU baseline (
n_gpu_layers=0): greedy-decodes 128 tokens and records the token ID sequence. Cached toresults/cpu_baselines.json. When testing multiple browsers, the baseline is collected once on the first browser and shared across all browsers (CPU output is identical regardless of JSPI vs Asyncify). When testing a single browser, the baseline runs in that same browser. - WebGPU run (
n_gpu_layers=999): performs a forced-decoding pass — feeds the CPU's token sequence one token at a time and checks whether the WebGPU backend independently predicts the same top-1 token at each position.
Why forced decoding
Naively comparing generated text suffers from cascading divergence: a single token difference changes the KV cache context for all subsequent tokens. Forced decoding evaluates each position independently, giving a clean per-token accuracy signal.
Interpreting CPU Match
| CPU Match | Interpretation |
|---|---|
100.0% | Numerically identical to CPU — no precision issues |
95–99% | A few tokens differ due to near-equal logits — expected for lower-precision quants |
< 90% | Systematic precision issues — GPU kernel may need investigation |
0.0% | First token wrong — quantization kernel likely broken |
— | No consistency data — benchmarks were run without --consistency |