lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Soni L. once stated:
> LuaJIT recently added Lua 5.3's "\u{}" escapes. It's also more strict 
> about Unicode errors than Lua 5.3[1].
> 
> For example, "\u{d800}" is valid in Lua 5.3, but not in LuaJIT.
> 
> Should Lua be more strict about Unicode errors?
> 
> [1] https://github.com/LuaJIT/LuaJIT/issues/72

  It depends.  I recently read (although I can't seem to find it now) that
one way to preserve invalid UTF-8 sequences is to encode the invalid bytes
in the D880 to D8FF range, and to reserve D800 as an alternative NUL byte
sequence (another NUL byte sequence is the literal byte sequence 0xC0 0x80
[1]).  By doing this, you can transform the "fixed" UTF-8 sequence back into
the original byte stream.

  There's also WTF-8 [2], used to troundtrip Window's filenames.

  And don't forget---full and proper Unicode support is expensive in terms
of data *and* code.  It can get pretty insane what with normalization,
combining characters and left-to-right and right-to-left characters.  

 -spc

[1]	https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

[2]	https://en.wikipedia.org/wiki/UTF-8#WTF-8