Scanner API for Multi-Pattern TextMate Tokenization
Status
Accepted
Context
Ferroni's primary real-world workload is TextMate grammar tokenization, used by syntax highlighters like Shiki and VS Code. TextMate grammars require a multi-pattern scanning interface: given N compiled regexes and a string position, find which regex matches earliest (and among ties, which has priority).
C Oniguruma provides OnigRegSet for multi-pattern matching, but it exposes a low-level C API. The JavaScript ecosystem uses Microsoft's vscode-oniguruma (MIT), which wraps Oniguruma in a higher-level OnigScanner interface tailored to TextMate tokenization. Shiki, VS Code, and other grammar-based tools depend on this specific interface.
To serve as a drop-in replacement for vscode-oniguruma in the Rust/NAPI ecosystem, Ferroni needs an API-compatible scanner layer — not just raw regex compilation and matching.
Decision
Add a Scanner API (src/scanner.rs) that wraps Ferroni's C-ported RegSet internals with a vscode-oniguruma-compatible interface.
API surface
-
Scanner::new(patterns): Compiles a set of regex patterns into aRegSetwith pre-computed first-byte dispatch tables and SIMD skip needles. -
Scanner::find_next_match(text, start_position): Searches all patterns simultaneously and returns aScannerMatchcontaining the winning pattern index and all capture group positions. SupportsScannerFindOptionsfor anchor control (NOT_BEGIN_STRING,NOT_END_STRING,NOT_BEGIN_POSITION), matching vscode-oniguruma'sFindOptionflags. -
ScannerMatch/CaptureIndex: Result types carrying the matched pattern index and byte-offset capture spans.CaptureIndexfields (start,end,length) mirror vscode-oniguruma's output format. -
ScannerSyntax: Enum selecting the regex syntax variant (Oniguruma, Asis, POSIX Basic/Extended, Java, Perl, Ruby, Python, etc.), matching vscode-oniguruma'sSyntaxtype.
Layering
Scanner (public API, vscode-oniguruma-compatible)
└── RegSet (C-ported internal, ADR-001)
└── onig_search / onig_match (C-ported VM, ADR-001)
The Scanner does not modify the C-ported internals. It is an additive layer that:
- Delegates pattern compilation to
onig_new()(same asRegex) - Delegates multi-pattern search to
onig_regset_search_fast()(C-ported RegSet) - Adds result formatting, option translation, and scratch space management
Relationship to ADR-005
ADR-005 defines the idiomatic Rust API layer (Regex, RegexBuilder, Match, Captures). The Scanner extends this layer with a domain-specific API for TextMate tokenization. It is not a general-purpose regex API but a purpose-built interface for the grammar tokenization use case.
Rationale
- Drop-in compatibility: Shiki and other tools can switch from vscode-oniguruma (WASM) to Ferroni (native) with minimal API translation.
- Performance: Native execution avoids WASM overhead. Combined with RegSet optimizations (first-byte dispatch, skip needle — see ADR-007), the Scanner achieves 4-6x match speedup and 215x compilation speedup compared to the WASM path.
- Zero internal disruption: The Scanner is purely additive — no C-ported code is modified. The full 1,695-test C-parity suite passes unchanged.
Consequences
src/scanner.rsis a substantial module (~66 KB) with its own test suite.- The
smallveccrate is added as a dependency forScannerMatch::capture_indices. - The Scanner API is Ferroni-specific — it does not exist in C Oniguruma. This is acceptable because it layers on top of (rather than modifying) the C-ported internals, consistent with ADR-001's parity goal.
- Future Scanner changes should maintain vscode-oniguruma interface compatibility to avoid breaking downstream integrations (Shiki, ferriki).