HTML Entity Trie Optimization — Status & Next Steps
Context
HTML is the only language where Rust loses to JS in codeToHtml benchmarks.
The HTML TextMate grammar contains an entity pattern with ~1,700 alternations in ~11,600 chars — a trie encoded as nested regex:
a(s(ymp(eq)?|cr|t)|n(d(slope|[dv]|and)?|g(...)))
Ferroni's existing trie optimizer (detect_literal_alternations) only handled flat alternations (word1|word2|word3). It couldn't detect nested trie structures because check_literal_branch() only recognized plain String nodes.
What was implemented (2026-03-03)
All changes in ferroni/src/regcomp.rs:
1. extract_literal_paths() — recursive nested extraction
Walks a nested AST branch and extracts all possible literal byte sequences:
- String: append bytes to each prefix
- List (sequence): thread prefixes through car → cdr
- Alt: fork into each branch, collect all resulting paths
- Bag (Memory if not backreferenced, Option): recurse into body
- Quant(0,1) (optional
?): fork into "with" and "without" paths - Anything else: return
None(not a literal)
Bounded by MAX_NESTED_TRIE_PATHS = 8192 to prevent exponential blowup.
2. Top-down processing order
Changed detect_literal_alternations_inner from bottom-up to top-down:
- Before: recurse into children first, then try to optimize the current Alt. Inner Alts got trie-optimized first, making them opaque (
ND_ST_LITERAL_ALT) to outer extraction. - After: try
extract_literal_pathson the current Alt first. If it succeeds, build one big trie covering many nested branches. Then recurse into remaining non-literal branches. Falls back to bottom-up for Alts where top-down extraction doesn't meet threshold.
Refactored into:
try_trie_optimize_alt()— shared logic for branch classification + trie buildingrecurse_into_children()— extracted recursion helperclassify_branch()— tries fast path (check_literal_branch), thenextract_literal_paths
3. backrefed_mem propagation
Added backrefed_mem: MemStatusType parameter through the detection chain so extract_literal_paths can safely skip backreferenced capture groups.
4. Tests (6 new, all passing)
nested_alt_trie_simple—a(b|c|d|e|f)gnested_alt_trie_optional— branch with?quantifiernested_alt_trie_partial_with_cclass— mixed literal + CClassnested_alt_trie_entity_like— large nested pattern (pure-literal subset)nested_alt_trie_backreferenced_capture_skipped— backref safetynested_alt_trie_non_capturing_group—(?:...)transparency
5. Benchmark update (ferriki)
Updated bench/engines/shiki-rust-napi.bench.ts:
- Replaced tiny HTML fallback sample with a realistic Wikipedia-style page full of HTML entities (
π,—,&,…,∑,∞, etc.) - Added
createHtmlSample()generator for ~100KB entity-heavy HTML - Added
large-html-entities-100kbbenchmark section
Results
Pure-literal nested patterns: works perfectly
Entity-like pattern without CClass nodes → 1 trie, 1 AltLiterals op, 0 Push ops (down from ~15 Push ops with bottom-up approach).
Real HTML entity pattern: limited impact
The actual entity regex has character classes scattered within nested branches:
[dv](2 members) ind(slope|[dv]|and)?[a-h](8 members) inmsd(a([a-h]))?[Ee](2 members) inp(id|os|prox(eq)?|[Ee]|acir)?[DUdu](4 members) — directional variants (Down/Up)[LRlr](4 members) — directional variants (Left/Right)[HLRhlr](6 members) — multi-directional variants[er](2 members) — suffix variants[0-9](10 members) — numeric suffixes (e.g.frac12,frac34)
All CClass nodes in the entity pattern have ≤ 10 single-byte members. None are negated, none contain multi-byte ranges.
These make extract_literal_paths return None for the containing branch, preventing full extraction. Result: partial optimization (some branches become trie, CClass-containing branches remain as backtracking).
E2E benchmark numbers
| Benchmark | Before | After | Change |
|---|---|---|---|
| codeToHtml > html (small sample) | JS 1.20x faster | — | — |
| codeToHtml > html (entity-heavy) | — | JS 1.48x faster | Entities actively hurt Rust |
| large-html-entities-100kb | — | JS ~1.19x faster | — |
The entity-heavy sample makes the gap worse, confirming that entity matching is a real bottleneck and the current partial optimization is insufficient.
Pre-existing ASAN bug (unrelated)
ASAN detected a heap-use-after-free in node_max_byte_len for patterns with named group captures ((.)(((?<_>a)))\k<_>). The freed memory comes from disable_noname_group_capture. This bug exists on main without any of these changes — confirmed by running ASAN on the clean baseline. The trie changes just shift allocation patterns enough to make it crash instead of silently reading stale memory.
Assessment of current changes
Keep. The changes are:
- Correct (213/213 lib tests pass, 43/43 compat_syntax pass, compat_utf8 only has a pre-existing failure)
- Non-regressive (no performance penalty for non-entity patterns)
- Foundation for both next-step approaches below
- The benchmark improvements (realistic HTML sample, large-html benchmark) provide proper measurement infrastructure regardless of which optimization path is pursued
Next Steps: Two approaches
Approach A: Extend extract_literal_paths to handle simple CClass nodes
Idea: When extract_literal_paths encounters a CClass node containing only a small number of single-byte members, enumerate them and fork into one path per member — same as Alt forking but driven by the character class.
Example: d(slope|[dv]|and)? currently fails because [dv] is a CClass. With expansion:
[dv]→ fork intodandvpaths- The whole branch extracts to:
["dslope", "dd", "dv", "dand", "d"]
Implementation sketch (all in regcomp.rs):
// In extract_literal_paths, add a new arm:
NodeInner::CClass(ref cc) => {
// Only expand small, single-byte-per-member classes
let members = cc.enumerate_single_bytes(); // new helper
if members.is_empty() || members.len() > 32 {
return None; // too large or contains multi-byte chars
}
let mut all = Vec::new();
for byte in members {
for prefix in ¤t_prefixes {
let mut p = prefix.clone();
p.push(byte);
all.push(p);
}
}
if all.len() > limit { return None; }
Some(all)
}The enumerate_single_bytes() helper on CClassNode would:
- Check that the class uses only single-byte (ASCII) ranges
- Return a
Vec<u8>of all member bytes - Return empty if the class contains multi-byte Unicode ranges, negation, or is too large
Risks:
- Path explosion:
[a-z]has 26 members,[A-Za-z]has 52. Combined with optionals and nested alts, this could blow pastMAX_NESTED_TRIE_PATHS. Mitigation: limit individual CClass expansion to e.g. 8 members. - The HTML entity regex has
[a-h](8 members) and[dv](2 members) — both small.[Ee]is 2 members. So a limit of ~8-16 would cover all cases in the entity pattern.
Expected impact: High. Most CClass nodes in the entity pattern are small (2-8 members). Full extraction would collapse the entity alternation into a single trie with ~2000 entries, reducing entity matching from O(1700 * len) to O(len).
Effort: Small. The internal representation is simple and well-suited for enumeration.
CClassNode internals (verified)
CClassNode in regparse_types.rs:481 has three fields:
pub struct CClassNode {
pub flags: u32, // FLAG_NCCLASS_NOT (negation), FLAG_NCCLASS_SHARE
pub bs: BitSet, // [u32; 8] = 256 bits, one per single-byte value
pub mbuf: Option<BBuf>, // Multi-byte Unicode ranges (code points ≥ 256)
}BitSet = [u32; 8] covers all 256 single-byte values. Membership check via bitset_at(bs, pos) in regint.rs:153.
enumerate_single_bytes() implementation — trivial:
impl CClassNode {
fn enumerate_single_bytes(&self) -> Option<Vec<u8>> {
// Bail on negated classes ([^...]) — would expand to ~250 entries
if self.is_not() { return None; }
// Bail if multi-byte ranges exist — can't represent in trie
if self.mbuf.is_some() { return None; }
let mut members = Vec::new();
for i in 0u16..256 {
if bitset_at(&self.bs, i as usize) {
members.push(i as u8);
}
}
if members.is_empty() { return None; }
Some(members)
}
}No complex parsing needed — just iterate 256 bits. The guard clauses (is_not(), mbuf.is_some()) handle all edge cases.
Path explosion analysis for the HTML entity pattern
Worst-case multiplicative effect of CClass expansion on the entity pattern:
| CClass | Members | Context | Multiplier |
|---|---|---|---|
[dv] | 2 | inside optional branch | ×2 |
[Ee] | 2 | inside optional branch | ×2 |
[er] | 2 | suffix variant | ×2 |
[DUdu] | 4 | directional | ×4 |
[LRlr] | 4 | directional | ×4 |
[HLRhlr] | 6 | multi-directional | ×6 |
[a-h] | 8 | inside optional+capture | ×8 |
[0-9] | 10 | numeric suffix | ×10 |
These CClass nodes appear in separate top-level branches of the entity trie (different initial letters), so their multipliers don't compound with each other. The base ~1,700 literal paths would grow to roughly ~2,000-2,200 paths — well within the MAX_NESTED_TRIE_PATHS = 8192 limit.
Recommended CClass expansion limit: 16 members per class. This covers all entity-pattern cases while rejecting [a-z] (26) and [A-Za-z] (52).
Approach B: Profile the actual HTML bottleneck
Idea: The entity pattern might not be the primary bottleneck. The 1.48x JS advantage on entity-heavy HTML could come from other grammar rules, the tokenizer loop overhead, or the HTML-to-string conversion cost. Profile before optimizing further.
Implementation sketch:
-
Micro-benchmark the entity regex directly in ferroni:
// In ferroni/benches/codspeed_bench.rs: add a benchmark that compiles the // full entity regex and matches it against entity-heavy input let re = Regex::new(ENTITY_PATTERN).unwrap(); b.iter(|| re.find("π — & … ..."));This isolates regex matching cost from tokenizer/rendering overhead.
-
Instrument ferriki's tokenizer loop:
- Count how many times the entity regex is attempted vs. matches
- Measure cumulative time in
match_atfor the entity pattern vs. other patterns - Use
perf/Instrumentson the native benchmark
-
Compare JS engine's entity handling:
- The JS regex engine (Oniguruma WASM or native
RegExp) may handle the entity pattern differently - JS
RegExpwith the same trie-encoded pattern might use V8's internal trie optimization automatically
- The JS regex engine (Oniguruma WASM or native
Expected impact: May reveal that the bottleneck is not the entity regex at all, redirecting effort to where it matters.
Effort: Medium. Requires profiling tooling and analysis.
Recommendation
Start with Approach A — it's a focused, incremental change that directly addresses the known gap (CClass nodes blocking extraction). If CClass expansion doesn't close the gap, Approach B provides the diagnostic data to find the real bottleneck.
The two approaches are complementary: A is a concrete optimization, B is a diagnostic. Even if A provides the expected improvement, B's profiling infrastructure is valuable for future work on other languages.
Decision gate
After implementing Approach A, run the entity-heavy benchmark:
- If Rust closes the gap to ≤ 1.05x JS: Entity trie was the bottleneck. Done.
- If gap remains > 1.10x JS: Entity trie was only part of the problem. Proceed to Approach B.
- If gap is unchanged: CClass branches were rarely hit in practice. Approach B becomes critical to find the real bottleneck.
Additional context: Pattern structure (912 capture groups)
The full entity pattern structure is: (&)(?=[A-Za-z])((nested_trie))(;) with 912 capture groups total. The lookahead (?=[A-Za-z]) provides minimal early filtering — it only checks that the next character is a letter, but doesn't narrow down which of the ~1,700 branches will match. This means the regex engine must attempt the full trie on every & followed by a letter in the input.
Note: Shiki JS uses the same Oniguruma engine (compiled to WASM via vscode-oniguruma), not V8's native RegExp. This means the performance difference is not due to V8's regex optimizations but rather differences in the Oniguruma compilation/execution path between native Rust FFI and WASM. This is worth investigating in Approach B — WASM Oniguruma may have different memory access patterns or the JS wrapper may batch regex operations differently.