Re: Should Lua be more strict about Unicode errors?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Should Lua be more strict about Unicode errors?
From: Sean Conner <sean@...>
Date: Sun, 30 Aug 2015 18:37:31 -0400

It was thus said that the Great Soni L. once stated:
> LuaJIT recently added Lua 5.3's "\u{}" escapes. It's also more strict 
> about Unicode errors than Lua 5.3[1].
> 
> For example, "\u{d800}" is valid in Lua 5.3, but not in LuaJIT.
> 
> Should Lua be more strict about Unicode errors?
> 
> [1] https://github.com/LuaJIT/LuaJIT/issues/72

  It depends.  I recently read (although I can't seem to find it now) that
one way to preserve invalid UTF-8 sequences is to encode the invalid bytes
in the D880 to D8FF range, and to reserve D800 as an alternative NUL byte
sequence (another NUL byte sequence is the literal byte sequence 0xC0 0x80
[1]).  By doing this, you can transform the "fixed" UTF-8 sequence back into
the original byte stream.

  There's also WTF-8 [2], used to troundtrip Window's filenames.

  And don't forget---full and proper Unicode support is expensive in terms
of data *and* code.  It can get pretty insane what with normalization,
combining characters and left-to-right and right-to-left characters.  

 -spc

[1]	https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

[2]	https://en.wikipedia.org/wiki/UTF-8#WTF-8

Follow-Ups:
- RE: Should Lua be more strict about Unicode errors?, Richter, Jörg

References:
- Should Lua be more strict about Unicode errors?, Soni L.

Prev by Date: Re: Should Lua be more strict about Unicode errors?
Next by Date: Think different
Previous by thread: Re: Should Lua be more strict about Unicode errors?
Next by thread: RE: Should Lua be more strict about Unicode errors?
Index(es):
- Date
- Thread