lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Thu, Feb 23, 2012 at 5:52 AM, Miles Bader <> wrote:
> Henning Diedrich <> writes:
>>>> One example of this is utf8 validity. assert_utf8(s) really wants to
>>>> privately decorate s with s::utf8.is_valid. As it happens there are
>>>> some reserved bits left in Lua string internals so this is one of
>>>> those rare cases of a cooperative v(ictim).
>>> ... except that "is valid utf8" is probably not a sufficiently important
>>> or useful concept to warrant using rare reserved bits for...
>>> -miles
>> I find this bifurcating, the only way to figure the intent of the poster,
>> (ironic | serious) is by looking up the history of his postings.

> I'm sorry, I don't understand what you wrote.  Are you complaining
> about thread drift?

[going to be literal here, since I know a lot of readers will be]

Not knowing you worked in a Japanese language environment I took your
statement that "everybody just does UTF-8 without error-checking" to
be grounded in the common ignorance of how ingrained EUC-*/GB2312/JIS
is--from an ocean's distance away in the Americas, many imagine there
really has been an orderly transition to UTF-8, since so many formats
assume it and programming languages map everything to Unicode.[1] In
context, I would classify your statement as gallows humor, as it's
what everybody ends up doing despite better possibilities--except for
the people who *are* using EUC/GB/JIS in Lua; although we don't hear
much from them here, they've gotta exist.

There are plenty of "all the world's ASCII/Latin-1/CP1252"[2] people
around, and there are a decent number of of "I write correct code, so
I don't worry about UTF-8 validity" proponents. (The latter probably
have not fretted about the MUST in the last paragraph of section 3 of
RFC 3629 [ ] which points
straight at the Security Considerations.)  Both of those classes would
find assert_utf8 memoization a fringe concern, and rightfully not
deserving of using up bits in the reserved byte in the string rep,

In addition, there is a group who think assert_utf8 is important but
is relatively uncommon, something not likely to be applied to many
strings except on input and output. As we've discussed, many common
string operations are closed over UTF-8 (and extended UTF-8) such that
it takes near zero effort for string primitives to flag their output
based on their inputs, and a precision of diagnosis style involves a
lot of assert_utf8() calls in the interior of code.

Worrying about reserved bits is a false economy anyway. The contents
of TString are not in the API, and are not visible outisde It's good engineering practice to only use
reserved bits on really critical functionality since running out can
be catastrophic. This is a case where the use of flag bits on the
string values themselves can easily be replaced by weak tables if a
better use for those bits comes along. Again, nobody would know except
people who already know what a TString is.

So given I know you know all of that, it is difficult to tell how
serious you are when you say "that's not a good use for those bits."


[1]: Like I said, I think Ruby's i18n string representation approach
might be an elaborate parody of alphabet-centric attitudes from the
1990s; the analogy certainly is amusing even if it is unintentional.
Parrot is a little more sophisticated, but still seems like a
nightmare. And I've worked with software built by a "Han unification
is contaminating our precious bodily fluids" person, and it was an
ongoing drag.

[2]: Around ~2000 I briefly email-interviewed with a company intent on
building a really hardcore HTML app version of their thick client. As
an aside, I mentioned their mail system was misidentifying their
CP1252 text as iso8859-1, and this could pose interoperability
problems. There are a couple of reasonable responses to this, but
"enh, whatever" from an architect designing for the interoperable Web
is not one of them. I didn't go work for them, and their product
didn't ship.