lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> Just to clarify: this is true for *arbitrary* (binary) strings. UTF-8
> strings, not being arbitrary, can be used inside any kind of literal in
> Lua ('...', "...", [[...]]) and also in comments without any problems.

This make me think of a trick that could be useful in some situations.
I know it is illegal according to UTF-8 specifications...
But using overlong UTF-8 sequences could be used to *escape* special
characters in string literals in a unified way !
Typically, new line, carriage return, tabulation are entered as \n, \r
and \t respectively. NUL byte and other control characters are written
in decimal or hexadecimal form as \000 or \x01. And characters " ' and
\ must often be entered as \", \' and \\.

Overlong UTF-8 sequences are characters coded in more bytes than
necessary [1]. For example, characters in the ASCII range 0-127 shall
be coded in a single byte in a UTF-8 compliant program, but could
numerically be coded in 2, 3 or 4 bytes sequences as well. The NUL
character 00 would then become C0 80, E0 80 80 or F0 80 80 80
respectively.

So for internal use I think they could be used as a simpler and more
general alternative to traditional string escaping (and not only in
Lua). For example in serialization / deserialization of data.

There are a number of drawbacks for that approach. Being illegal, most
text editors will reject or silently convert overlong sequences, and
do not have a way to enter such a sequence neither. Other UTF-8 aware
software libraries will also reject overlong sequences. This seriously
limit the number of practical usages !

Is this idea completely stupid or has any practical interest ?

[1] http://en.wikipedia.org/wiki/UTF-8