lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I think the middle ground here is to distinguish between Lua strings as byte arrays and UTF-8 as sequences of code points. Doing full-blown Unicode support is horrendous, but providing library functions that could (perhaps not very efficiently) discern code points within the UTF-8 byte stream would be useful, without providing interpretation of those code points (which is where all the complexity of Unicode lies). One approach to this (not one I am championing, though), would be routines to convert a UTF-8 string to/from a Lua array of codepoints (with each, presumably, a number). A better approach would be able to walk the string directly one code point at a time. None of these are good solutions, and imho Unicode switched from a solution to a problem when they overflowed the BMP.

--Tim


On Feb 6, 2012, at 9:29 PM, Miles Bader wrote:

> HyperHacker <hyperhacker@gmail.com> writes:
>> I do think a simple UTF-8 library would be quite a good thing to have
>> - basically just have all of Lua's string methods, but operating on
>> characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
>> extract the 3rd to 6th characters of str, not necessarily bytes.) My
>> worry though would be ending up like PHP, where you have to remember
>> to use the mb_* functions instead of the normal ones.
>> 
>> I suspect this could be accomplished by means of a function that
>> "converts" a string to a UTF-8 string, which would be represented as a
>> table or userdata with metamethods to make it behave like a string.
>> Then you could just write:
>> str = U'this is a UTF-8 string'
>> print(#str) --gives number of characters, not number of bytes
>> the main problem I can see then would be that type(str) ~= "string"...
> 
> Even what you're suggesting sounds pretty heavy-weight.
> 
> I think many people looking at the issue try too hard to come up with
> some pretty abstraction, but that the actual benefit to users of these
> abstractions isn't so great... especially for environments (like Lua)
> where one is trying to minimize support libraries.
> 
> For instance, I don't think the illusion of unit characters is
> particularly valuable for most apps, for instance, and trying to
> maintain that illusion is expensive.  Nor does it seem necessary to
> hide the encoding unless you're in the position of needing to support
> legacy multibyte encodings (and I'm ignoring that case because it adds
> a huge amount of hair which I think isn't worth messing up the common
> case for).
> 
> My intuition is that almost all string processing tends to treat
> strings not as sequences of "characters" so much as sequences of other
> strings, many of which are fixed, and so have known properties.
> 
> It seems much more realistic to me -- and perfectly usable -- to
> simply say that strings contain UTF-8, and offer a few functions like:
> 
>  utf8.unicode_char (STRING[, BYTE_INDEX = 0]) => UNICHAR
>  utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX
> 
> ["char_offset" maybe defined to work properly if the input byte index
> isn't on a proper character boundary, and with a special case for
> NUM_CHARS == 0 to align to the beginning of the character containing
> the input byte index.]
> 
> verrry simple and light-weight.
> 
> Most existing string functions are also perfectly usable on UTF-8, and
> do something reasonable with it:
> 
>   sub
> 
>        Works fine if the indices are calculated reasonably -- and I
>        think this is almost always the case.  People don't generally
>        do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
>        string position, e.g. by searching, or string beginning/end,
>        and maybe calculate offsets based on _known_ contents, e.g.
>        [[ string.sub (s, 1, string.find (s, "/") - 1) ]]
> 
>        [One exception might be chopping a string to fit some length
>        limit using [[ string.sub (s, 1, LIMIT) ]].  Where it's
>        actually a byte limit (fixed buffers etc), something like [[
>        string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,
>        but for things like _display_ limits, calculating display
>        widths of unicode characters isn't so easy...even with full
>        tables.]
> 
>   upper
>   lower
> 
>        Works fine, but of course only upcases ASCII characters.
>        However doing this "properly" requires unicode tables, so
>        isn't appropriate for a minimal library I guess.
> 
>   len
> 
>        Works fine for calculating the string byte length -- which is
>        often what is actually wanted -- or calculating the string
>        index of the end of the string (for further searching or
>        whatever).
> 
>   rep
>   format
> 
>        Work fine (only use concatenation)
> 
>   byte
>   char
> 
>        Work fine
> 
>   find
>   match
>   gmatch
>   gsub
> 
>        Work fine for the most part.  The main exception, of course,
>        is single-character wildcards, ".", "[^abc]", etc, when used
>        without a repeat suffix -- but I think in practice, these are
>        very rarely used without a repeat suffix.
> 
>        Some of the patterns are limited to ASCII in their
>        interpration of course (e.g. "%a"), but this isn't really
>        fixable without full unicode tables, and the ASCII-only
>        interpretation is not dangerous.
> 
>   dump
> 
>        N/A
> 
>   reverse
> 
>        Now _this_ will probably simply fail for strings containing
>        non-ASCII UTF-8.  But it's also probably not very widely
>        used...
> 
> 
> IOW, before trying to come up with some pretty (and expensive)
> abstraction, it seems worthwhile to think: in what _real_ situations
> (i.e., actually occur in practice) does simply "doing nothing" not
> work?  In some cases, code might have to be tweaked a little, but I
> suspect it's often enough to just say "so don't do that" (because most
> code doesn't do that anyway).
> 
> The main question I suppose is:  is the resulting user code, using
> mostly ordinary string functions plus a little minimal utf8 tweaking,
> going to be significantly uglier/harder-to-maintain/confusing, to the
> point where using a heavier-weight abstraction might be worthwhile?
> 
> My suspicion is that for most apps, the answer is no...
> 
> -miles
> 
> -- 
> Yossarian was moved very deeply by the absolute simplicity of
> this clause of Catch-22 and let out a respectful whistle.
> "That's some catch, that Catch-22," he observed.
> "It's the best there is," Doc Daneeka agreed.
>