lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


What about sorting/collating? That would be useful. But is that a big-table-thing in Unicode?
Best regards
ignorant Egil


On 2012-02-07 22:40, Jay Carlson wrote:
On Tue, Feb 7, 2012 at 5:29 AM, Miles Bader<miles@gnu.org>  wrote:
HyperHacker<hyperhacker@gmail.com>  writes:
I do think a simple UTF-8 library would be quite a good thing to have
- basically just have all of Lua's string methods, but operating on
characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
extract the 3rd to 6th characters of str, not necessarily bytes.) My
worry though would be ending up like PHP, where you have to remember
to use the mb_* functions instead of the normal ones.
So I started down this path, and realized the same thing Miles did: I
very rarely did this. General multilingual text is far more
complicated than ASCII, and there's not much one really can do in,
say, a loop iteration with COMBINING DOUBLE INVERTED BREVE. Contrary
to monolingual practice, iteration or addressing by code point is just
not that common.

I think many people looking at the issue try too hard to come up with
some pretty abstraction, but that the actual benefit to users of these
abstractions isn't so great... especially for environments (like Lua)
where one is trying to minimize support libraries.
Yeah. "Just throw it in" seems like a typical disaster like the HTML
DOM. http://c2.com/xp/YouArentGonnaNeedIt.html says:

|    "Always implement things when you actually need them, never when
you just foresee that you need them."
|    Even if you're totally, totally, totally sure that you'll need a
feature later on, don't implement it now. Usually, it'll turn out
either a) you don't need it after all, or b) what you actually need is
quite different from what you foresaw needing earlier.

I like to say that when you build generality for a future you don't
understand, when the future arrives you find out you were right: you
didn't understand it.

My intuition is that almost all string processing tends to treat
strings not as sequences of "characters" so much as sequences of other
strings, many of which are fixed, and so have known properties.
Yeah. string.gmatch is the real string iteration operation, for
multiple reasons. It expresses intent compactly, it's implemented in C
so it's faster than an interpreter, and it's common enough that it
should implemented once by smart, focused people who are probably
going to do a better job at it than you are in each individual
instance.

As I type that, I notice those look remarkably like the arguments for
any inclusion in string.* anyway.

It seems much more realistic to me -- and perfectly usable -- to
simply say that strings contain UTF-8,
...well-formed UTF-8...

and offer a few functions like:

  utf8.unicode_char (STRING[, BYTE_INDEX = 0]) =>  UNICHAR
  utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) =>  NEW_BYTE_INDEX
I agree, although I would prefer we talk about code points, since
people coming from "one glyph, one character" environments (the
precomposed world) are just going to lose when their mental model
encounters U+202B RIGHT-TO-LEFT EMBEDDING or combining characters or
all of the oddities of scripts I never have seen.

Most existing string functions are also perfectly usable on UTF-8, and
do something reasonable with it:
...when the functions' domain is well-formed UTF-8...

   sub

        Works fine if the indices are calculated reasonably -- and I
        think this is almost always the case.  People don't generally
        do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
        string position, e.g. by searching, or string beginning/end,
        and maybe calculate offsets based on _known_ contents, e.g.
        [[ string.sub (s, 1, string.find (s, "/") - 1) ]]
In an explicitly strongly-typed language, these numbers, call them
UINDEXs, would belong to a separate type, because you can't do
arithmetic on them in any obviously useful way.

I appreciate that concise examples are hard, but [[
string.sub(s,1,string.find(s,"/")-1) ]] sounds more like a weakness in
the string library. I'm lazy so I would write [[ string.match(s,
"(.-)/") ]]. This violates "write code in Lua, not in regexps"
principle, especially if you write it as the string.match(s,
"([^/]*)/") or other worse things that happen before I've finished my
coffee. Sprinkle with assert() to taste.

I have a mental blind spot on non-greedy matching. Iterating on
"(.-)whatever" is one of those things I wish was in that hjypothetical
annotated manual.

In regexp languages without non-greedy capture I have to write a
function returning [[ string_before_match, match = sfind(s, "/") ]].
This is the single-step split function, and is a primitive in how I
think about processing strings.[1]

        [One exception might be chopping a string to fit some length
        limit using [[ string.sub (s, 1, LIMIT) ]].  Where it's
        actually a byte limit (fixed buffers etc), something like [[
        string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,
Agree. A clamp function that returns at most n bytes of valid UTF-8.
This may separate composed characters, but the usage model is you'll
either be concatenating with the following characters later, or you
really don't care about textual fidelity because truncation is the
most important goal.

Because of how UTF-8 works, it is easy to produce valid UTF-8 output
when given valid UTF-8 input. (Get used to that sentence.)

        but for things like _display_ limits, calculating display
        widths of unicode characters isn't so easy...even with full
        tables.]
This looks like it should be done in a library, but there is a useful
thing like it. I could see clamping to n code points instead of bytes.
For the precomposed world, this can approximate character cells in an
xterm, especially if you count for CJK fullwidth as two. So as an
exercise, here's what you'd need to do.

Theoretically you need the whole table, but given the sloppy goal of
"don't run off the end of the 80 'column' CJK line if you can help it"
it's 0x11xx, 0x2Fxx-0x9Fxx, 0xAC-0xD7, 0xF9-0xFA. Outside the BMP,
0x200-0x2FF. Yes, I shortchanged the I Ching.[3]

The arrangement of Unicode does suggest a iterator/mapper primitive:
given a code point c, look up t[c>>  16]. If it's a
number/string/boolean, return it; if it's a table, return t[c>>16][c
&&  0xff]. Presumably these would be gets rather than rawgets, so
subtables would have an __index which could look up and memoize on the
fly. This would handle the jagged U+1160-U+11FF more correctly. I
dunno. I said nobody wants to iterate over strings, and now I've
contradicted myself.

   upper
   lower
Because of how UTF-8 works, it is easy to produce valid UTF-8 output
when given valid UTF-8 input.

        Works fine, but of course only upcases ASCII characters.
...if you're in the C locale. Amusingly--well, no, frustratingly--the
MacPorts version of lua run interactively gives different results
because readline sets the locale:

$ export LC_ALL=en_US.ISO8859-15
$ lua -e 's=string f="%02x" print(f:format(s.byte(s.upper(s.char(0xE9)))))'
e9
$ lua
Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
s=string f="%02x" print(f:format(s.byte(s.upper(s.char(0xE9)))))
c9

   len
        [...works] for calculating the string
        index of the end of the string (for further searching or
        whatever).
Yeah. It returns that UINDEX opaque number type when used that way.

   rep
   format
Because of how UTF-8 works, it is *guaranteed* to produce valid UTF-8
output when given valid UTF-8 input. This guarantee not valid if you
end up in a non-UTF-8 locale with number formatting outside ASCII....

   byte
   char

        Work fine
Whaaa? #char(0xE9) needs to be 2 if we're working in the UTF-8 text
domain. Similarly, string.byte("Я")) needs to return 1071. Which says
"byte" is a bad name but those two need to be inverses.

byte(s, i, j) is also only defined when beginning and ending at UINDEX
positions; that is to say, you can't start in the middle of a UTF-8
sequence and you can't stop in the middle of one either. But given the
#char(0xE9)==2 definition, char is not a loophole for bogus UTF-8 to
sneak in.

Obviously we still need to keep around bytewise operations on some
stringlike thing. (Wait, obviously? If you're not in a single-byte
locale the candidates for bytewise ops look like things you don't
necessarily want to intern at first sight. The counterexample is
something like binary SHA-1 for table lookup.)

   find
   match
   gmatch
   gsub

        Work fine for the most part.  The main exception, of course,
        is single-character wildcards, ".", "[^abc]", etc, when used
        without a repeat suffix -- but I think in practice, these are
        very rarely used without a repeat suffix.
Agree. I think "." and friends need to consume a whole UTF-8 code
point, since otherwise they could return confusing values which aren't
UINDEXs, and produce captures with invalid content. I imagine this is
not that hard as long as non-ASCII subpatterns are not allowed. "Я" is
not a single byte, and "Я+" looks painful.) But you could easily
search for a literal "Я" the same way you search for "abc" now.

With the consumption caveat, it is (relatively) easy to produce valid
UTF-8 output when given valid UTF-8 input.

With that definition, I just noticed that string.gmatch(".") *is* the
code point iterator. Hmm.

   reverse
        Now _this_ will probably simply fail for strings containing
        non-ASCII UTF-8.
And you don't want to reverse your combining code points either.
Nobody will use it.

IOW, before trying to come up with some pretty (and expensive)
abstraction, it seems worthwhile to think: in what _real_ situations
(i.e., actually occur in practice) does simply "doing nothing" not
work?  In some cases, code might have to be tweaked a little, but I
suspect it's often enough to just say "so don't do that" (because most
code doesn't do that anyway).
I'm beginning to think we could get away with doing most of this in
UTF-8 with existing operations, *and* have some hope of retaining
well-formedness, without too much additional code or runtime overhead.
But the catch is that you must guarantee well-formed UTF-8 on the way
in or you'll get garbage out. So that's why I want a memoized
assert_utf8(): defend the border, and a lot of other things take care
of themselves. Otherwise, you'll undoubtedly get invalid singleton
high bytes wandering into a string, causing random other bugs with
little hope of tracing back to the place where you added one to a
UOFFSET.

If we are interning strings with a hash, we have to walk their whole
length anyway; might be worth checking then. For those operations the
C code is really certain are valid UTF-8, they could tell luaS_newlstr
about it. lua_concat of UTF-8 strings is guaranteed to be UTF-8....

The main question I suppose is:  is the resulting user code, using
mostly ordinary string functions plus a little minimal utf8 tweaking,
going to be significantly uglier/harder-to-maintain/confusing, to the
point where using a heavier-weight abstraction might be worthwhile?

My suspicion is that for most apps, the answer is no...
Well, that certainly makes Roberto happy. I think after going through
this exercise, the unresolved question is whether there should be a
byte vs text distinction in operations.

I think I've made a good case text.* would reduce the cost of bugs by
localizing the source of error, and the complexity of implementation
doesn't look that bad.  With such a distinction between text and
bytes, it's not as high a compatibility cost to switch or add UTF-16
or UTF-32 internal representations; in the latter, it just turns out
every UINDEX is valid.

The elephant in the room is normalization forms; once you've got all
these parts, you're going to want NFC. But that's big-table, and a
loadable library can provide a string-to-string transformation.

Jay

[1]: (string_before, match) is a perl4 habit I think, from the $`
special variable. It has an easy implementation for literal matches,
and *that* habit probably goes back to Applesoft, Commodore, or Harris
VULCAN BASIC. Along with jwz's "now they have two problems"[2] I
believe I've personally hit the Dijkstra trifecta (
http://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/EWD498.html
):

"PL/I—'the fatal disease'—belongs more to the problem set than to the
solution set.

"It is practically impossible to teach good programming to students
that have had a prior exposure to BASIC: as potential programmers they
are mentally mutilated beyond hope of regeneration.

"The use of COBOL cripples the mind; its teaching should, therefore,
be regarded as a criminal offence."

[2]: See http://regex.info/blog/2006-09-15/247 for a lot of
original-source background on "now they have two problems". I spent a
bunch of time tracking down a *bogus* jwz attribution which I would in
turn cite, but handhelds.org mail archives have been down for like a
year, and http://article.gmane.org/gmane.comp.handhelds.ipaq.general/12198
and jwz's followup at
http://article.gmane.org/gmane.comp.handhelds.ipaq.general/12226 are
not responding this morning either....

[3]: No blame.