lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> > For example, "\u{d800}" is valid in Lua 5.3, but not in LuaJIT.
> >
> > Should Lua be more strict about Unicode errors?
> >
> > [1] https://github.com/LuaJIT/LuaJIT/issues/72
> 
>   It depends.  I recently read (although I can't seem to find it now) that
> one way to preserve invalid UTF-8 sequences is to encode the invalid bytes
> in the D880 to D8FF range, and to reserve D800 as an alternative NUL byte
> sequence (another NUL byte sequence is the literal byte sequence 0xC0 0x80
> [1]).  By doing this, you can transform the "fixed" UTF-8 sequence back
> into the original byte stream.

I think you mean "UTF-8B".  Quoting [1] 

"utf-8b is a mapping from byte streams to unicode codepoint streams that provides 
an exceptionally clean handling of garbage (i.e., non-utf-8) bytes (i.e., bytes 
that are not part of a utf-8 encoding) in the input stream. They are mapped to 
256 different, guaranteed undefined, unicode codepoints."

- Jörg

[1] http://hyperreal.org/~est/freeware/