lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Tue, Feb 7, 2012 at 5:29 AM, Miles Bader <miles@gnu.org> wrote:
> HyperHacker <hyperhacker@gmail.com> writes:
>> I do think a simple UTF-8 library would be quite a good thing to have
>> - basically just have all of Lua's string methods, but operating on
>> characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
>> extract the 3rd to 6th characters of str, not necessarily bytes.) My
>> worry though would be ending up like PHP, where you have to remember
>> to use the mb_* functions instead of the normal ones.

So I started down this path, and realized the same thing Miles did: I
very rarely did this. General multilingual text is far more
complicated than ASCII, and there's not much one really can do in,
say, a loop iteration with COMBINING DOUBLE INVERTED BREVE. Contrary
to monolingual practice, iteration or addressing by code point is just
not that common.

> I think many people looking at the issue try too hard to come up with
> some pretty abstraction, but that the actual benefit to users of these
> abstractions isn't so great... especially for environments (like Lua)
> where one is trying to minimize support libraries.

Yeah. "Just throw it in" seems like a typical disaster like the HTML
DOM. http://c2.com/xp/YouArentGonnaNeedIt.html says:

|    "Always implement things when you actually need them, never when
you just foresee that you need them."
|    Even if you're totally, totally, totally sure that you'll need a
feature later on, don't implement it now. Usually, it'll turn out
either a) you don't need it after all, or b) what you actually need is
quite different from what you foresaw needing earlier.

I like to say that when you build generality for a future you don't
understand, when the future arrives you find out you were right: you
didn't understand it.

> My intuition is that almost all string processing tends to treat
> strings not as sequences of "characters" so much as sequences of other
> strings, many of which are fixed, and so have known properties.

Yeah. string.gmatch is the real string iteration operation, for
multiple reasons. It expresses intent compactly, it's implemented in C
so it's faster than an interpreter, and it's common enough that it
should implemented once by smart, focused people who are probably
going to do a better job at it than you are in each individual
instance.

As I type that, I notice those look remarkably like the arguments for
any inclusion in string.* anyway.

> It seems much more realistic to me -- and perfectly usable -- to
> simply say that strings contain UTF-8,

...well-formed UTF-8...

> and offer a few functions like:
>
>  utf8.unicode_char (STRING[, BYTE_INDEX = 0]) => UNICHAR
>  utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX

I agree, although I would prefer we talk about code points, since
people coming from "one glyph, one character" environments (the
precomposed world) are just going to lose when their mental model
encounters U+202B RIGHT-TO-LEFT EMBEDDING or combining characters or
all of the oddities of scripts I never have seen.

> Most existing string functions are also perfectly usable on UTF-8, and
> do something reasonable with it:

...when the functions' domain is well-formed UTF-8...

>   sub
>
>        Works fine if the indices are calculated reasonably -- and I
>        think this is almost always the case.  People don't generally
>        do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
>        string position, e.g. by searching, or string beginning/end,
>        and maybe calculate offsets based on _known_ contents, e.g.
>        [[ string.sub (s, 1, string.find (s, "/") - 1) ]]

In an explicitly strongly-typed language, these numbers, call them
UINDEXs, would belong to a separate type, because you can't do
arithmetic on them in any obviously useful way.

I appreciate that concise examples are hard, but [[
string.sub(s,1,string.find(s,"/")-1) ]] sounds more like a weakness in
the string library. I'm lazy so I would write [[ string.match(s,
"(.-)/") ]]. This violates "write code in Lua, not in regexps"
principle, especially if you write it as the string.match(s,
"([^/]*)/") or other worse things that happen before I've finished my
coffee. Sprinkle with assert() to taste.

I have a mental blind spot on non-greedy matching. Iterating on
"(.-)whatever" is one of those things I wish was in that hjypothetical
annotated manual.

In regexp languages without non-greedy capture I have to write a
function returning [[ string_before_match, match = sfind(s, "/") ]].
This is the single-step split function, and is a primitive in how I
think about processing strings.[1]

>        [One exception might be chopping a string to fit some length
>        limit using [[ string.sub (s, 1, LIMIT) ]].  Where it's
>        actually a byte limit (fixed buffers etc), something like [[
>        string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,

Agree. A clamp function that returns at most n bytes of valid UTF-8.
This may separate composed characters, but the usage model is you'll
either be concatenating with the following characters later, or you
really don't care about textual fidelity because truncation is the
most important goal.

Because of how UTF-8 works, it is easy to produce valid UTF-8 output
when given valid UTF-8 input. (Get used to that sentence.)

>        but for things like _display_ limits, calculating display
>        widths of unicode characters isn't so easy...even with full
>        tables.]

This looks like it should be done in a library, but there is a useful
thing like it. I could see clamping to n code points instead of bytes.
For the precomposed world, this can approximate character cells in an
xterm, especially if you count for CJK fullwidth as two. So as an
exercise, here's what you'd need to do.

Theoretically you need the whole table, but given the sloppy goal of
"don't run off the end of the 80 'column' CJK line if you can help it"
it's 0x11xx, 0x2Fxx-0x9Fxx, 0xAC-0xD7, 0xF9-0xFA. Outside the BMP,
0x200-0x2FF. Yes, I shortchanged the I Ching.[3]

The arrangement of Unicode does suggest a iterator/mapper primitive:
given a code point c, look up t[c >> 16]. If it's a
number/string/boolean, return it; if it's a table, return t[c>>16][c
&& 0xff]. Presumably these would be gets rather than rawgets, so
subtables would have an __index which could look up and memoize on the
fly. This would handle the jagged U+1160-U+11FF more correctly. I
dunno. I said nobody wants to iterate over strings, and now I've
contradicted myself.

>   upper
>   lower

Because of how UTF-8 works, it is easy to produce valid UTF-8 output
when given valid UTF-8 input.

>        Works fine, but of course only upcases ASCII characters.

...if you're in the C locale. Amusingly--well, no, frustratingly--the
MacPorts version of lua run interactively gives different results
because readline sets the locale:

$ export LC_ALL=en_US.ISO8859-15
$ lua -e 's=string f="%02x" print(f:format(s.byte(s.upper(s.char(0xE9)))))'
e9
$ lua
Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> s=string f="%02x" print(f:format(s.byte(s.upper(s.char(0xE9)))))
c9

>   len
>        [...works] for calculating the string
>        index of the end of the string (for further searching or
>        whatever).

Yeah. It returns that UINDEX opaque number type when used that way.

>   rep
>   format

Because of how UTF-8 works, it is *guaranteed* to produce valid UTF-8
output when given valid UTF-8 input. This guarantee not valid if you
end up in a non-UTF-8 locale with number formatting outside ASCII....

>   byte
>   char
>
>        Work fine

Whaaa? #char(0xE9) needs to be 2 if we're working in the UTF-8 text
domain. Similarly, string.byte("Я")) needs to return 1071. Which says
"byte" is a bad name but those two need to be inverses.

byte(s, i, j) is also only defined when beginning and ending at UINDEX
positions; that is to say, you can't start in the middle of a UTF-8
sequence and you can't stop in the middle of one either. But given the
#char(0xE9)==2 definition, char is not a loophole for bogus UTF-8 to
sneak in.

Obviously we still need to keep around bytewise operations on some
stringlike thing. (Wait, obviously? If you're not in a single-byte
locale the candidates for bytewise ops look like things you don't
necessarily want to intern at first sight. The counterexample is
something like binary SHA-1 for table lookup.)

>   find
>   match
>   gmatch
>   gsub
>
>        Work fine for the most part.  The main exception, of course,
>        is single-character wildcards, ".", "[^abc]", etc, when used
>        without a repeat suffix -- but I think in practice, these are
>        very rarely used without a repeat suffix.

Agree. I think "." and friends need to consume a whole UTF-8 code
point, since otherwise they could return confusing values which aren't
UINDEXs, and produce captures with invalid content. I imagine this is
not that hard as long as non-ASCII subpatterns are not allowed. "Я" is
not a single byte, and "Я+" looks painful.) But you could easily
search for a literal "Я" the same way you search for "abc" now.

With the consumption caveat, it is (relatively) easy to produce valid
UTF-8 output when given valid UTF-8 input.

With that definition, I just noticed that string.gmatch(".") *is* the
code point iterator. Hmm.

>   reverse
>        Now _this_ will probably simply fail for strings containing
>        non-ASCII UTF-8.

And you don't want to reverse your combining code points either.
Nobody will use it.

> IOW, before trying to come up with some pretty (and expensive)
> abstraction, it seems worthwhile to think: in what _real_ situations
> (i.e., actually occur in practice) does simply "doing nothing" not
> work?  In some cases, code might have to be tweaked a little, but I
> suspect it's often enough to just say "so don't do that" (because most
> code doesn't do that anyway).

I'm beginning to think we could get away with doing most of this in
UTF-8 with existing operations, *and* have some hope of retaining
well-formedness, without too much additional code or runtime overhead.
But the catch is that you must guarantee well-formed UTF-8 on the way
in or you'll get garbage out. So that's why I want a memoized
assert_utf8(): defend the border, and a lot of other things take care
of themselves. Otherwise, you'll undoubtedly get invalid singleton
high bytes wandering into a string, causing random other bugs with
little hope of tracing back to the place where you added one to a
UOFFSET.

If we are interning strings with a hash, we have to walk their whole
length anyway; might be worth checking then. For those operations the
C code is really certain are valid UTF-8, they could tell luaS_newlstr
about it. lua_concat of UTF-8 strings is guaranteed to be UTF-8....

> The main question I suppose is:  is the resulting user code, using
> mostly ordinary string functions plus a little minimal utf8 tweaking,
> going to be significantly uglier/harder-to-maintain/confusing, to the
> point where using a heavier-weight abstraction might be worthwhile?
>
> My suspicion is that for most apps, the answer is no...

Well, that certainly makes Roberto happy. I think after going through
this exercise, the unresolved question is whether there should be a
byte vs text distinction in operations.

I think I've made a good case text.* would reduce the cost of bugs by
localizing the source of error, and the complexity of implementation
doesn't look that bad.  With such a distinction between text and
bytes, it's not as high a compatibility cost to switch or add UTF-16
or UTF-32 internal representations; in the latter, it just turns out
every UINDEX is valid.

The elephant in the room is normalization forms; once you've got all
these parts, you're going to want NFC. But that's big-table, and a
loadable library can provide a string-to-string transformation.

Jay

[1]: (string_before, match) is a perl4 habit I think, from the $`
special variable. It has an easy implementation for literal matches,
and *that* habit probably goes back to Applesoft, Commodore, or Harris
VULCAN BASIC. Along with jwz's "now they have two problems"[2] I
believe I've personally hit the Dijkstra trifecta (
http://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD498.html
):

"PL/I—'the fatal disease'—belongs more to the problem set than to the
solution set.

"It is practically impossible to teach good programming to students
that have had a prior exposure to BASIC: as potential programmers they
are mentally mutilated beyond hope of regeneration.

"The use of COBOL cripples the mind; its teaching should, therefore,
be regarded as a criminal offence."

[2]: See http://regex.info/blog/2006-09-15/247 for a lot of
original-source background on "now they have two problems". I spent a
bunch of time tracking down a *bogus* jwz attribution which I would in
turn cite, but handhelds.org mail archives have been down for like a
year, and http://article.gmane.org/gmane.comp.handhelds.ipaq.general/12198
and jwz's followup at
http://article.gmane.org/gmane.comp.handhelds.ipaq.general/12226 are
not responding this morning either....

[3]: No blame.