WebGPU Bench
Back to Dashboard

Methodology

How Benchmarks Work

  1. build.sh compiles llama.cpp to WebAssembly with WebGPU support via Emscripten + emdawnwebgpu, producing two WASM variants: JSPI (Chrome) and Asyncify (Firefox, Safari).
  2. runner.js launches Playwright browsers and navigates to harness.html.
  3. harness.js detects JSPI support and loads the correct WASM variant.
  4. The GGUF model is downloaded from HuggingFace directly in the browser.
  5. Inference runs via WebGPU (or CPU fallback) using llama.cpp's C API with greedy sampling for deterministic output.
  6. Performance metrics are collected via llama_perf_context() and returned to Playwright.
  7. A fresh browser instance is launched for each variant to prevent WASM memory accumulation (OOM fix).

Dashboard Columns

ColumnDescription
MachineMachine slug identifying the hardware (e.g. apple-m3-16gb-darwin)
ModelModel name (e.g. Llama-3.2-1B-Instruct)
QuantQuantization variant (e.g. Q4_K_M, Q8_0)
Size (MB)Model file size in megabytes
BrowserBrowser used for the benchmark (chromium, firefox, webkit)
StatusPASS if inference completed successfully, FAIL otherwise
Buildjspi or asyncify — which WASM variant was used. Chrome supports JSPI; Firefox and Safari use Asyncify.
WebGPUWhether WebGPU was available in the browser. If not, inference falls back to CPU.
Decode tok/sToken generation speed (tokens/sec) — main performance metric
Prefill tok/sPrompt processing speed (tokens/sec)
n_evalNumber of tokens generated during decode
t_eval (ms)Total decode time in milliseconds
n_p_evalNumber of prompt tokens processed during prefill
t_p_eval (ms)Total prefill time in milliseconds
Wall (s)Total wall-clock time for the benchmark run in seconds (includes model download, load, and inference)
CPU MatchConsistency with CPU baseline — percentage of token positions where WebGPU and CPU agree on the top-1 token. Only present when benchmarks are run with --consistency. See Consistency Measurement below.
ErrorError message and category (OOM, WASM Abort, Timeout, etc.) when status is FAIL

Error Categories

CategoryPatternTypical Cause
OOMout of memory, memory allocationModel too large for available WASM memory
WASM Abortwasm, abort, unreachableWASM execution error, often from unsupported operations
Timeouttimeout, timed outBenchmark exceeded time limit (model download or inference)
Download Faileddownload, fetch, 404, networkModel file not found or network error
Othereverything elseUncategorized errors

Consistency Measurement

The --consistency flag measures how faithfully the WebGPU backend reproduces the CPU computation for each quantization type.

How it works

For each variant, two runs are performed:

  1. CPU baseline (n_gpu_layers=0): greedy-decodes 128 tokens and records the token ID sequence. Cached to results/cpu_baselines.json. When testing multiple browsers, the baseline is collected once on the first browser and shared across all browsers (CPU output is identical regardless of JSPI vs Asyncify). When testing a single browser, the baseline runs in that same browser.
  2. WebGPU run (n_gpu_layers=999): performs a forced-decoding pass — feeds the CPU's token sequence one token at a time and checks whether the WebGPU backend independently predicts the same top-1 token at each position.

Why forced decoding

Naively comparing generated text suffers from cascading divergence: a single token difference changes the KV cache context for all subsequent tokens. Forced decoding evaluates each position independently, giving a clean per-token accuracy signal.

Interpreting CPU Match

CPU MatchInterpretation
100.0%Numerically identical to CPU — no precision issues
95–99%A few tokens differ due to near-equal logits — expected for lower-precision quants
< 90%Systematic precision issues — GPU kernel may need investigation
0.0%First token wrong — quantization kernel likely broken
No consistency data — benchmarks were run without --consistency