[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: RE: Should Lua be more strict about Unicode errors?
- From: Richter, Jörg <Joerg.Richter@...>
- Date: Mon, 31 Aug 2015 07:20:54 +0000
> > For example, "\u{d800}" is valid in Lua 5.3, but not in LuaJIT.
> >
> > Should Lua be more strict about Unicode errors?
> >
> > [1] https://github.com/LuaJIT/LuaJIT/issues/72
>
> It depends. I recently read (although I can't seem to find it now) that
> one way to preserve invalid UTF-8 sequences is to encode the invalid bytes
> in the D880 to D8FF range, and to reserve D800 as an alternative NUL byte
> sequence (another NUL byte sequence is the literal byte sequence 0xC0 0x80
> [1]). By doing this, you can transform the "fixed" UTF-8 sequence back
> into the original byte stream.
I think you mean "UTF-8B". Quoting [1]
"utf-8b is a mapping from byte streams to unicode codepoint streams that provides
an exceptionally clean handling of garbage (i.e., non-utf-8) bytes (i.e., bytes
that are not part of a utf-8 encoding) in the input stream. They are mapped to
256 different, guaranteed undefined, unicode codepoints."
- Jörg
[1] http://hyperreal.org/~est/freeware/