Benchmark Strategy

Status

Accepted

Context

Ferroni's performance claims need reproducible evidence. The question is what to benchmark, how to measure, and how to prevent regressions.

Decision

A two-suite benchmark architecture:

Reference suite: `battle_bench`

Real-world scenarios that produce the numbers we publish. Always Ferroni vs Oniguruma at -O3.

Category	What is measured
Syntax highlighting	Full, unmodified Shiki grammars -- TypeScript (279 patterns), CSS (117 patterns), Rust (81 patterns). Compile time, first-match latency, full-line tokenization.
Text search	Literal search, no-match rejection, field extraction, timestamp matching on 10-50 KB log inputs.
Pattern matching	One representative pattern per regex feature (quantifiers, lookaround, Unicode, backreferences, alternation, named captures).
Compilation	Simple to complex patterns, measuring compile latency.

Key rule: benchmark against complete, unmodified production grammars -- no cherry-picked subsets. The Shiki grammars are committed as-is in benches/grammars/.

Internal suite: `codspeed_bench`

Ferroni-only micro-benchmarks tracked by CodSpeed in CI. These catch performance regressions before they reach main. They are intentionally internal-facing: useful for optimizing parser, executor, scanner, RegSet, and public API paths, but not meant for README marketing tables.

Tooling

Criterion.rs for local measurement and HTML reports (target/criterion/report/index.html).
codspeed-criterion-compat for CI integration -- same benchmark code, instrumented for CodSpeed's wall-time tracking.
C comparison via optional ffi feature. The cc crate builds Oniguruma from a pinned local source snapshot prepared on demand for head-to-head measurement.
Pinned battle inputs in benches/battle_inputs.toml. This records the exact external artifacts behind the publishable suite.

Build profile

Both release and bench profiles use lto = "thin" to allow cross-crate inlining (especially for memchr) without the compile-time cost of full LTO. This matches realistic deployment conditions.

Rationale

Real grammars prevent overfitting. Benchmarking against subsets risks optimizing for patterns that don't matter.
C comparison keeps claims honest. Every speedup number is relative to the same engine at -O3, not a strawman.
Two suites separate concerns. battle_bench is small, stable, and publishable; codspeed_bench is free to optimize for regression coverage and engineering feedback.
Compilation is part of the workload. Syntax highlighters compile grammars at startup. Ignoring compile time gives an incomplete picture.
README numbers should stay human-scale. The README intentionally rounds values; exact raw numbers live in /perf/benchmark-results.

Consequences

Shiki grammar JSON files are committed to the repository (benches/grammars/). These are updated when Shiki releases new grammar versions.
The ffi feature adds a C build step. Running ./scripts/prepare-oniguruma-sources.sh && cargo bench --features ffi --bench battle_bench requires a C compiler; cargo bench (without ffi) runs the Rust-only internal suite.
Exact external input revisions for battle_bench live in benches/battle_inputs.toml; each published measurement run should also record machine and toolchain details in /perf/benchmark-results.
Reference benchmark results are documented in /perf/benchmark-results.md and summarized in README.
New optimizations should be validated against the internal suite first; user-facing claims should cite the reference suite.