lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 02/11/2012 14:27, Rob Hoelz wrote:
Hi list,

A user came on the IRC channel today asking about Unicode support.
When I tried to explain that string.sub(2, 2) and io.read(1) wouldn't
work as expected on UTF-8 data by explaining that Lua only uses 8-bit
clean strings and doesn't understand UTF-8 data beyond a string of
bytes, the user pointed out that the manual speaks in terms of
characters, not bytes.

Would it be a good idea to make a distinction between characters and
bytes, or do you guys feel that this is already clear in the manual
(and PiL)?

-Rob

I think it would not hurt to repeat, where relevant, Lua-char = byte.
However, probably manuals should not go farther about mentioning unicode, else they may introduce the *very* usual "misconception about characters vs" so-called "abstract characters"; which are in fact an intermediate state between bytes (or other code units) and characters proper. A Lua text is a string of bytes, to get a safe index or pair of indices launch a proper search function (*), et voilà!

Denis

(*) unless you're dealing with machine-generated and user-inaccessible plain ascii source