Encoding Scope — ASCII and UTF-8 Only
Status
Accepted
Context
The C original supports 29 character encodings: ASCII, UTF-8, UTF-16 BE/LE, UTF-32 BE/LE, 15 ISO-8859 variants, EUC-JP/KR/TW, Shift-JIS, Big5, GB18030, KOI8, KOI8-R, and CP1251. These are implemented across 32 source files totaling ~7,000 LOC, plus encoding-specific property tables (e.g. euc_jp_prop.c, sjis_prop.c).
The question is which encodings to port.
Decision
The Rust port supports only ASCII and UTF-8. No other encodings will be ported.
What this means
ONIG_ENCODING_ASCIIandONIG_ENCODING_UTF8are the only two available encodings.- The
Encodingtrait and dispatch infrastructure exist and are fully functional — additional encodings could be added, but we choose not to. - The 27 missing encodings (~6,600 LOC of C code) are intentionally excluded.
- Encoding-specific public API functions (
onigenc_set_default_encoding,onig_initialize_encoding,onigenc_strlen_null, etc.) are not ported. - C test files that depend on other encodings (
testc.cfor EUC-JP,testu.cfor UTF-16,testp.cfor POSIX) are not ported.
Rationale
- UTF-8 has won. In modern software, UTF-8 is the dominant encoding for text processing. ASCII is its subset. Together they cover virtually all contemporary use cases.
- Effort vs. value. Porting 27 encodings would add ~6,600 LOC of mechanical translation work with near-zero benefit for the target audience. Each encoding requires implementing 15+ trait methods, porting lookup tables, and writing or adapting tests.
- The encoding trait is the escape hatch. If a specific encoding is ever needed, the infrastructure is in place — it is a contained, additive change that does not affect the core engine.
- Reduced attack surface. Fewer encodings mean fewer code paths to audit and fewer potential encoding-related bugs.
Consequences
- Users who need non-UTF-8 encodings (e.g. Japanese legacy systems using EUC-JP or Shift-JIS) cannot use this port. They should use the C original via FFI bindings.
onig_new_deluxe(which allows specifying different pattern and string encodings) is not ported — it is only useful with multiple encodings.- Test parity is 100% for UTF-8 (
test_utf8.c: 1,554/1,554), but overall C test coverage is partial (~1,554 of ~3,970 encoding-dependent test cases).