SIMD-Accelerated Search via memchr
Status
Accepted
Context
ADR-001 establishes 1:1 structural parity with C Oniguruma. However, the C original's forward search pipeline uses hand-written byte-by-byte loops for literal prefix scanning (forward_search_range in regexec.c). Modern CPUs provide SIMD instructions (SSE2/AVX2 on x86-64, NEON on aarch64) that can scan memory significantly faster, but C Oniguruma does not use them.
The question is whether to deviate from the C original's search implementation for a performance improvement.
Decision
Replace the hand-written byte scan loops in the search pipeline with the memchr crate, which provides SIMD-vectorized implementations of single-byte, two-byte, and three-byte search.
Specifically:
- 1-byte literal prefix:
memchr::memchr(replaces byte-by-byte scan) - 2-byte set:
memchr::memchr2(replaces manual two-candidate loop) - 3-byte set:
memchr::memchr3(replaces manual three-candidate loop) - Boyer-Moore-Horspool (BMH): Kept as-is from C (already efficient for multi-byte patterns)
- Map search (>3 distinct bytes): Kept as-is from C (256-entry lookup table)
The structural shape of the search pipeline is unchanged -- the same functions exist, the same decision logic selects the strategy, and the same fallbacks apply. Only the inner byte-scanning loops are replaced.
Rationale
- Measurable impact: The largest gains appear in full-text no-match scenarios where the engine must scan the entire haystack:
- 10 KB no-match: 381 ns vs 1.9 us (5x faster)
- 50 KB no-match: 1.5 us vs 9.3 us (6x faster)
- Minimal code change: The
memchrcrate is a single dependency with zero transitive dependencies and zerounsafein its public API. - No behavioral change: The SIMD paths find the same bytes at the same positions. All existing tests pass unchanged.
- This is the kind of deviation ADR-001 allows: The control flow and API are identical; only the inner scan loop implementation differs, similar to how a C compiler might auto-vectorize a loop.
Consequences
- The
memchrcrate is a build dependency. - Benchmarks for literal and RegSet search are 20-60% faster than C. No-match full-text scans are 5-6x faster.
- The Map search path (>3 distinct first bytes) still uses a 256-entry byte map, same as C. SIMD dispatch only covers 1-3 byte sets.
- Thin LTO (
lto = "thin"in release profile) is enabled to allow cross-crate inlining ofmemchrcalls without the compile-time cost of full LTO.
Extension: RegSet First-Byte Dispatch and Skip Needle
The original memchr integration accelerates single-pattern forward search. For multi-pattern workloads (Scanner/RegSet — see ADR-006), two additional SIMD-assisted optimizations were added:
First-byte dispatch table (first_byte_candidates)
A Box<[Vec<u16>; 256]> lookup table built at RegSet construction time. For each byte value 0-255, it stores the indices of regexes whose first-byte pre-filter does not exclude that byte. At search time, instead of trying all N regexes at each position, only the candidates matching the current input byte are attempted.
Sources for first-byte information (in priority order):
- Exact prefix: Only the first byte of the regex's exact literal prefix
- First-byte map: The regex's 256-entry first-byte prefilter bitmap (covers CClass, multi-byte ranges, etc.)
- Fallback: Regex is added to all 256 slots (no filtering possible)
Skip needle (SkipNeedle)
Derived from the dispatch table: identifies byte values where no regex has candidates. These "impossible" bytes are collected into a SIMD skip needle (1-3 bytes via memchr/memchr2/memchr3). The search loop uses this needle to jump past positions where no regex can possibly match, avoiding per-position candidate lookups entirely.
Impact
Combined, these optimizations reduce RegSet position-lead search overhead by ~44% on TextMate grammar workloads. The memory cost is one 256-entry dispatch table per RegSet (~2-4 KB depending on regex count).
Additional consequences
- RegSet dispatch tables add ~2-4 KB memory per RegSet instance. This is negligible compared to the compiled regex bytecode.