Skip to content

Conversation

@mznet
Copy link

@mznet mznet commented Jan 17, 2026

Calling strip_identifier with identifiers that contain cjk characters causes a Rust panic as shown below

assert!(is_valid_javascript_identifier("한글"));
thread 'js_identifiers::tests::test_is_valid_javascript_identifier' (1076350) panicked at src/js_identifiers.rs:49:12:
byte index 4 is not a char boundary; it is inside '글' (bytes 3..6) of `한글`
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/f8297e351a40c1439a467bbbb6879088047f50b3/library/std/src/panicking.rs:698:5
   1: core::panicking::panic_fmt
             at /rustc/f8297e351a40c1439a467bbbb6879088047f50b3/library/core/src/panicking.rs:75:14
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
             at /rustc/f8297e351a40c1439a467bbbb6879088047f50b3/library/core/src/str/mod.rs:69:5
   4: core::str::traits::<impl core::slice::index::SliceIndex<str> for core::ops::range::Range<usize>>::index
             at /Users/mjet.plane/.rustup/toolchains/1.91.0-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/str/traits.rs:248:21
   5: core::str::traits::<impl core::slice::index::SliceIndex<str> for core::ops::range::RangeInclusive<usize>>::index
             at /Users/mjet.plane/.rustup/toolchains/1.91.0-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/str/traits.rs:664:33
   6: core::str::traits::<impl core::slice::index::SliceIndex<str> for core::ops::range::RangeToInclusive<usize>>::index
             at /Users/mjet.plane/.rustup/toolchains/1.91.0-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/str/traits.rs:751:24
   7: core::str::traits::<impl core::ops::index::Index<I> for str>::index
             at /Users/mjet.plane/.rustup/toolchains/1.91.0-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/str/traits.rs:63:15
   8: sourcemap::js_identifiers::strip_identifier
             at ./src/js_identifiers.rs:49:12
   9: sourcemap::js_identifiers::is_valid_javascript_identifier
             at ./src/js_identifiers.rs:54:5

Using cjk characters in identifiers is not common, but I found examples where cjk characters are used as identifiers and the panic happened.
Javascript identifiers are not limited to ascii characters and can include unicode characters.

The current implementation stored only the start byte index of each character in end_idx while iterating, and then sliced the string using an inclusive range &s[..=end_idx].

For example, the string "한글", '한' occupies bytes 0–2 and '글' occupies bytes 3–5, but when processing '글', i = 3 is stored as end_idx and slicing with &s[..=3] breaks the UTF-8 character boundary.
This results in a byte index is not a char boundary panic.

To fix, the code now tracks the end position of each character instead of the start position by calculating end_idx = i + c.len_utf8(), and uses an exclusive range &s[..end_idx] when slicing.

This change covers not only CJK characters but also other non-ASCII Unicode identifiers.

mznet added 2 commits January 17, 2026 15:40
Change "変数名" (Japanese) to "变量名" (Chinese) for better CJK coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant