Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Jay Carlson <nop@...>
Date: Fri, 10 Feb 2012 03:31:34 +0000

On Feb 9, 2012 1:38 PM, "Roberto Ierusalimschy" <roberto@inf.puc-rio.br> wrote:
>
> > Getting lua's core to change its view of strings to being something
> > other than a byte-sequence isn't going to happen, its not the lua way,
>
> Sure.

On reflection, there is an argument this isn't just a matter of whether one likes Unicode. See: Roberto Ierusalimschy, Luiz Henrique de Figueiredo, and Waldemar Celes, "Passing a Language through the Eye of a Needle: How the embeddability of Lua impacted its design", ACM Queue vol. 9, no. 5, May 2011. http://queue.acm.org/detail.cfm?id=1983083

C strings are arbitrary sequences of bytes. \0 termination is dominant, although explicit length is used as well. If API is destiny, then until C converges on some other kind of representation, there is little or nothing to be done except preserve untyped arrays of bytes. Lua types are not identical to C but they must not subtely differ from each other.

Besides complexity, this seems like the next strongest argument for avoiding a distinct Unicode string type.

I do not see an obvious way of tagging strings by intent (this one is a byte buffer, this is text for a window title) which is why I have been focusing on allowing some functions to guarantee they are receiving and returning valid UTF-8 strings; instead of intent carried in value, it's denoted by which flavor of functions one uses on the string.

Unlike a tagged type, functions using assertions can't detect type errors at every call; they will discover them as non-ASCII data is passed through. But they'll be discovered precisely at the point at which the type assumption fails, rather than causing unpredictable behavior at some point in the future.

My mind has been focused on the precision of diagnosis problem by an awful argument in how to begin to teach programming to electrical engineers. In C, the failure modes of mismanipulation of pointers, bounds checks, and memory allocation are "undefined" behavior. By undefined, it means the implementation is allowed to do anything, including starting up a web browser pointing at cutethingsfallingasleep.org precisely 5% of the time, but as Microsoft and Adobe are aware the consequences are usually more painful. The argument was that making mistakes in pointer discipline *should* be painful, and intro classes should wash out those not suited for the field anyway.

(This wandered into how SICP was a bad book for learning how "real" computers work, and at that point I realized that people who hadn't skimmed to the last half of the book were probably not the ideal people to discuss its pedagogy with, and I took my leave.)

String operations closed under UTF-8 have a lot of nice properties. However, if a non-UTF-8 sequence manages to sneak in, the error may be difficult to localize and may be contagious. It will be detected later when, say, a conformant XML processor attempts to read the output but I worry this is all-too-similar to "random broken behavior later forces you to focus on correctness all the time, so it's good for you." I suppose we are surrounded by subtle logic bugs, but it feels bad to add another class.

D did not come up, but yes, every language I enjoy programming in also has GC, bounds checking, etc. Most of them do have distinct Unicode types, and after going through a lot of Java 1.1 and Python IO code I decided many of the complaints were just whining about having to explicitly identify encodings (Java picked the wrong answer) or in fact deal with non-byte text at all. As I said the Tower of Babel incident was unfortunate, but you can't wish it away. Forcing people who are currently ignoring it to write code portable to non-byte locales doesn't seem to be working very well either though.

BTW, I had this wonderful, awful idea that searching with the utf-flavor functions could return not 1 or 7 for where a match began, but 1.1 and 7.7. Well 1.025 or something more mantissa friendly, just break accidental integer arithmetic, because there is nothing of use at byte offset 7.7+1; these index numbers make little sense outside the context of a particular string. I'm probably joking....

Jay

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy

Prev by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by Date: Re: Suggestion, Lua Sizes
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread