Benchmark Strategy
Status
Accepted
Context
Ferroni's performance claims need reproducible evidence. The question is what to benchmark, how to measure, and how to prevent regressions.
Decision
A two-suite benchmark architecture:
Reference suite: battle_bench
Real-world scenarios that produce the numbers we publish. Always Ferroni vs Oniguruma at -O3.
| Category | What is measured |
|---|---|
| Syntax highlighting | Full, unmodified Shiki grammars -- TypeScript (279 patterns), CSS (117 patterns), Rust (81 patterns). Compile time, first-match latency, full-line tokenization. |
| Text search | Literal search, no-match rejection, field extraction, timestamp matching on 10-50 KB log inputs. |
| Pattern matching | One representative pattern per regex feature (quantifiers, lookaround, Unicode, backreferences, alternation, named captures). |
| Compilation | Simple to complex patterns, measuring compile latency. |
Key rule: benchmark against complete, unmodified production grammars -- no cherry-picked subsets. The Shiki grammars are committed as-is in benches/grammars/.
Internal suite: codspeed_bench
Ferroni-only micro-benchmarks tracked by CodSpeed in CI. These catch performance regressions before they reach main. They are intentionally internal-facing: useful for optimizing parser, executor, scanner, RegSet, and public API paths, but not meant for README marketing tables.
Tooling
- Criterion.rs for local measurement and HTML reports (
target/criterion/report/index.html). - codspeed-criterion-compat for CI integration -- same benchmark code, instrumented for CodSpeed's wall-time tracking.
- C comparison via optional
ffifeature. Thecccrate builds Oniguruma from a pinned local source snapshot prepared on demand for head-to-head measurement. - Pinned battle inputs in
benches/battle_inputs.toml. This records the exact external artifacts behind the publishable suite.
Build profile
Both release and bench profiles use lto = "thin" to allow cross-crate inlining (especially for memchr) without the compile-time cost of full LTO. This matches realistic deployment conditions.
Rationale
- Real grammars prevent overfitting. Benchmarking against subsets risks optimizing for patterns that don't matter.
- C comparison keeps claims honest. Every speedup number is relative to the same engine at
-O3, not a strawman. - Two suites separate concerns.
battle_benchis small, stable, and publishable;codspeed_benchis free to optimize for regression coverage and engineering feedback. - Compilation is part of the workload. Syntax highlighters compile grammars at startup. Ignoring compile time gives an incomplete picture.
- README numbers should stay human-scale. The README intentionally rounds values; exact raw numbers live in
/perf/benchmark-results.
Consequences
- Shiki grammar JSON files are committed to the repository (
benches/grammars/). These are updated when Shiki releases new grammar versions. - The
ffifeature adds a C build step. Running./scripts/prepare-oniguruma-sources.sh && cargo bench --features ffi --bench battle_benchrequires a C compiler;cargo bench(withoutffi) runs the Rust-only internal suite. - Exact external input revisions for
battle_benchlive inbenches/battle_inputs.toml; each published measurement run should also record machine and toolchain details in/perf/benchmark-results. - Reference benchmark results are documented in
/perf/benchmark-results.mdand summarized inREADME. - New optimizations should be validated against the internal suite first; user-facing claims should cite the reference suite.