lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


HyperHacker <hyperhacker@gmail.com> writes:
> I do think a simple UTF-8 library would be quite a good thing to have
> - basically just have all of Lua's string methods, but operating on
> characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
> extract the 3rd to 6th characters of str, not necessarily bytes.) My
> worry though would be ending up like PHP, where you have to remember
> to use the mb_* functions instead of the normal ones.
>
> I suspect this could be accomplished by means of a function that
> "converts" a string to a UTF-8 string, which would be represented as a
> table or userdata with metamethods to make it behave like a string.
> Then you could just write:
> str = U'this is a UTF-8 string'
> print(#str) --gives number of characters, not number of bytes
> the main problem I can see then would be that type(str) ~= "string"...

Even what you're suggesting sounds pretty heavy-weight.

I think many people looking at the issue try too hard to come up with
some pretty abstraction, but that the actual benefit to users of these
abstractions isn't so great... especially for environments (like Lua)
where one is trying to minimize support libraries.

For instance, I don't think the illusion of unit characters is
particularly valuable for most apps, for instance, and trying to
maintain that illusion is expensive.  Nor does it seem necessary to
hide the encoding unless you're in the position of needing to support
legacy multibyte encodings (and I'm ignoring that case because it adds
a huge amount of hair which I think isn't worth messing up the common
case for).

My intuition is that almost all string processing tends to treat
strings not as sequences of "characters" so much as sequences of other
strings, many of which are fixed, and so have known properties.

It seems much more realistic to me -- and perfectly usable -- to
simply say that strings contain UTF-8, and offer a few functions like:

  utf8.unicode_char (STRING[, BYTE_INDEX = 0]) => UNICHAR
  utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX

["char_offset" maybe defined to work properly if the input byte index
isn't on a proper character boundary, and with a special case for
NUM_CHARS == 0 to align to the beginning of the character containing
the input byte index.]

verrry simple and light-weight.

Most existing string functions are also perfectly usable on UTF-8, and
do something reasonable with it:

   sub

        Works fine if the indices are calculated reasonably -- and I
        think this is almost always the case.  People don't generally
        do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
        string position, e.g. by searching, or string beginning/end,
        and maybe calculate offsets based on _known_ contents, e.g.
        [[ string.sub (s, 1, string.find (s, "/") - 1) ]]

        [One exception might be chopping a string to fit some length
        limit using [[ string.sub (s, 1, LIMIT) ]].  Where it's
        actually a byte limit (fixed buffers etc), something like [[
        string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,
        but for things like _display_ limits, calculating display
        widths of unicode characters isn't so easy...even with full
        tables.]

   upper
   lower

        Works fine, but of course only upcases ASCII characters.
        However doing this "properly" requires unicode tables, so
        isn't appropriate for a minimal library I guess.

   len

        Works fine for calculating the string byte length -- which is
        often what is actually wanted -- or calculating the string
        index of the end of the string (for further searching or
        whatever).

   rep
   format

        Work fine (only use concatenation)

   byte
   char

        Work fine

   find
   match
   gmatch
   gsub

        Work fine for the most part.  The main exception, of course,
        is single-character wildcards, ".", "[^abc]", etc, when used
        without a repeat suffix -- but I think in practice, these are
        very rarely used without a repeat suffix.

        Some of the patterns are limited to ASCII in their
        interpration of course (e.g. "%a"), but this isn't really
        fixable without full unicode tables, and the ASCII-only
        interpretation is not dangerous.

   dump

        N/A

   reverse

        Now _this_ will probably simply fail for strings containing
        non-ASCII UTF-8.  But it's also probably not very widely
        used...


IOW, before trying to come up with some pretty (and expensive)
abstraction, it seems worthwhile to think: in what _real_ situations
(i.e., actually occur in practice) does simply "doing nothing" not
work?  In some cases, code might have to be tweaked a little, but I
suspect it's often enough to just say "so don't do that" (because most
code doesn't do that anyway).

The main question I suppose is:  is the resulting user code, using
mostly ordinary string functions plus a little minimal utf8 tweaking,
going to be significantly uglier/harder-to-maintain/confusing, to the
point where using a heavier-weight abstraction might be worthwhile?

My suspicion is that for most apps, the answer is no...

-miles

-- 
Yossarian was moved very deeply by the absolute simplicity of
this clause of Catch-22 and let out a respectful whistle.
"That's some catch, that Catch-22," he observed.
"It's the best there is," Doc Daneeka agreed.