Porting Bugs -- Lessons Learned
Status
Accepted
Context
During the 1:1 port from C to Rust, 28 non-trivial bugs were found and fixed. All were logic divergences between the C and Rust implementations, not Rust-specific issues. Documenting these serves future contributors and anyone attempting a similar C-to-Rust port.
Bug Catalog
Address and Offset Arithmetic (5 bugs)
| # | Bug | Root Cause |
|---|---|---|
| 1 | JUMP/PUSH address formulas | Bytecode jump offsets calculated relative to wrong position |
| 2 | u_offset patching | Unicode offset not applied after compilation |
| 3 | usize subtraction overflow | Unsigned subtraction wrapping (C uses signed int, Rust uses usize) |
| 9 | compile_length_quantifier_node off-by-body_len | Body length miscounted by one operation |
| 10 | Alternation JUMP addresses (>2 branches) | Jump target wrong when alternation has 3+ branches |
Lesson: Bytecode address arithmetic is the highest-risk area in the port. Every offset formula must be validated against the C original with multi-branch test cases. Rust's unsigned usize makes subtraction bugs surface as panics rather than silent wraparound -- this is actually helpful for debugging.
Parser Edge Cases (5 bugs)
| # | Bug | Root Cause |
|---|---|---|
| 4 | Parser operates on &[u8], not &str | UTF-8 validation rejected valid byte patterns |
| 5 | Multi-byte codepoint truncation | U+305B has low byte 0x5B = [, confusing the parser |
| 6 | C trigraph escaping in tests | \? in C means ? (trigraph avoidance), not an escape |
| 19 | \pL single-char property tokenizer | Property name tokenizer didn't handle single-char names |
| 20 | check_code_point_sequence must not advance p | Function side-effected the position cursor |
Lesson: The parser must work on raw bytes, not Rust strings. Multi-byte UTF-8 codepoints can have byte values that coincide with ASCII syntax characters -- the parser must always check encoding boundaries.
VM State Machine (10 bugs)
| # | Bug | Root Cause |
|---|---|---|
| 7 | Empty-loop detection | tune_tree must set qn.emptiness for quantifier bodies that can match empty |
| 8 | String-split for quantifiers | check_quantifier must split string nodes so the quantifier applies to the last character only |
| 11 | Missing ?? and greedy expansion paths | Non-greedy optional and greedy quantifier code paths were incomplete |
| 12 | EmptyCheckEnd skip-next | Empty check must skip the next operation (the repeat increment) |
| 13 | backtrack_mem / backrefed_mem | Bitmasks tracking which captures are involved in backtracking were not propagated |
| 14 | EmptyCheckEndMemst | Memory-aware empty check variant not dispatching to correct handler |
| 16 | FIXED_INTERVAL_IS_GREEDY_ONLY | Fixed-interval quantifiers must only apply the greedy optimization path |
| 17 | StrN payload mismatch | String operation payload length field disagreed with actual string length |
| 18 | RepeatIncNg stack ordering | Non-greedy repeat increment pushed stack entries in wrong order |
| 24 | EmptyCheckEnd mem-ID mismatch | Nested quantifiers generated mismatched empty-check IDs |
Lesson: The VM's empty-loop detection and backtracking state are deeply intertwined. A bug in emptiness detection cascades into infinite loops or missed matches. Test with patterns like (a*)*, (a?)*, (a*)+ and their non-greedy variants.
Optimizer and Compiler (5 bugs)
| # | Bug | Root Cause |
|---|---|---|
| 15 | Reversed intervals {5,2} | Interval with lower > upper must be swapped, not rejected |
| 22 | SaveVal ID collision | Save-value IDs for different purposes collided |
| 23 | reduce_string_list loses ND_ST_SUPER | String merge optimization dropped the "super" status flag |
| 25 | Call-node body vs target_node in optimizer | Optimizer followed wrong pointer (call body instead of target group) |
| 26 | (?P:...) parser fallthrough | P case must delegate to prs_options when QMARK_CAPITAL_P_NAME syntax is not set |
Lesson: Optimizer passes must preserve all node flags and status bits. When merging or transforming nodes, every flag on the source node must be checked for carry-over.
Executor Dispatch (3 bugs)
| # | Bug | Root Cause |
|---|---|---|
| 21 | CUT_TO_MARK must void, not pop | Mark-cutting operation must void (discard) stack entries, not pop them |
| 27 | WordBoundary executor ignored mode parameter | ASCII-mode (?W:\b) always used Unicode word check instead of ASCII-only |
| 28 | TextSegmentBoundary compiler used reg.options | Should use per-node status ND_ST_TEXT_SEGMENT_WORD, not global regex options |
Lesson: The VM executor has mode-sensitive opcodes where behavior depends on flags embedded in the operation payload. Always verify that the correct flag source (node status vs. regex options vs. operation mode field) is consulted.
Key Takeaways
-
All 28 bugs were logic divergences, not Rust-specific issues. The 1:1 porting approach (ADR-001) made every bug diagnosable by line-by-line comparison with the C source.
-
Rust caught several bugs earlier than C would have. Unsigned integer overflow panics (bug #3), exhaustive match requirements (bugs #11, #14), and the borrow checker (bug #20) all surfaced issues that might have been silent in C.
-
The highest-risk areas are bytecode address arithmetic, VM empty-loop detection, and optimizer flag preservation. These deserve the most careful review in any future changes.
-
Multi-byte UTF-8 is a persistent source of parser bugs. Any byte in a multi-byte sequence can coincide with an ASCII syntax character. The parser must always respect encoding boundaries.
Consequences
- New contributions to the compiler or VM should include test cases for edge cases listed above (empty quantifiers, nested alternations, mode-sensitive opcodes).
- Code review should cross-reference the C original for any function in
regcomp.rs,regexec.rs, orregparse.rs.