[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: What do you miss most in Lua
- From: Tim Hill <drtimhill@...>
- Date: Mon, 6 Feb 2012 22:23:41 -0800
I think the middle ground here is to distinguish between Lua strings as byte arrays and UTF-8 as sequences of code points. Doing full-blown Unicode support is horrendous, but providing library functions that could (perhaps not very efficiently) discern code points within the UTF-8 byte stream would be useful, without providing interpretation of those code points (which is where all the complexity of Unicode lies). One approach to this (not one I am championing, though), would be routines to convert a UTF-8 string to/from a Lua array of codepoints (with each, presumably, a number). A better approach would be able to walk the string directly one code point at a time. None of these are good solutions, and imho Unicode switched from a solution to a problem when they overflowed the BMP.
--Tim
On Feb 6, 2012, at 9:29 PM, Miles Bader wrote:
> HyperHacker <hyperhacker@gmail.com> writes:
>> I do think a simple UTF-8 library would be quite a good thing to have
>> - basically just have all of Lua's string methods, but operating on
>> characters instead of bytes. (So e.g. ustring.sub(str, 3, 6) would
>> extract the 3rd to 6th characters of str, not necessarily bytes.) My
>> worry though would be ending up like PHP, where you have to remember
>> to use the mb_* functions instead of the normal ones.
>>
>> I suspect this could be accomplished by means of a function that
>> "converts" a string to a UTF-8 string, which would be represented as a
>> table or userdata with metamethods to make it behave like a string.
>> Then you could just write:
>> str = U'this is a UTF-8 string'
>> print(#str) --gives number of characters, not number of bytes
>> the main problem I can see then would be that type(str) ~= "string"...
>
> Even what you're suggesting sounds pretty heavy-weight.
>
> I think many people looking at the issue try too hard to come up with
> some pretty abstraction, but that the actual benefit to users of these
> abstractions isn't so great... especially for environments (like Lua)
> where one is trying to minimize support libraries.
>
> For instance, I don't think the illusion of unit characters is
> particularly valuable for most apps, for instance, and trying to
> maintain that illusion is expensive. Nor does it seem necessary to
> hide the encoding unless you're in the position of needing to support
> legacy multibyte encodings (and I'm ignoring that case because it adds
> a huge amount of hair which I think isn't worth messing up the common
> case for).
>
> My intuition is that almost all string processing tends to treat
> strings not as sequences of "characters" so much as sequences of other
> strings, many of which are fixed, and so have known properties.
>
> It seems much more realistic to me -- and perfectly usable -- to
> simply say that strings contain UTF-8, and offer a few functions like:
>
> utf8.unicode_char (STRING[, BYTE_INDEX = 0]) => UNICHAR
> utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX
>
> ["char_offset" maybe defined to work properly if the input byte index
> isn't on a proper character boundary, and with a special case for
> NUM_CHARS == 0 to align to the beginning of the character containing
> the input byte index.]
>
> verrry simple and light-weight.
>
> Most existing string functions are also perfectly usable on UTF-8, and
> do something reasonable with it:
>
> sub
>
> Works fine if the indices are calculated reasonably -- and I
> think this is almost always the case. People don't generally
> do [[ string.sub (UNKNOWN_STRING, 3, 6) ]], they calculate a
> string position, e.g. by searching, or string beginning/end,
> and maybe calculate offsets based on _known_ contents, e.g.
> [[ string.sub (s, 1, string.find (s, "/") - 1) ]]
>
> [One exception might be chopping a string to fit some length
> limit using [[ string.sub (s, 1, LIMIT) ]]. Where it's
> actually a byte limit (fixed buffers etc), something like [[
> string.sub (s, 1, utf8.char_offset (s, LIMIT)) ]] suffices,
> but for things like _display_ limits, calculating display
> widths of unicode characters isn't so easy...even with full
> tables.]
>
> upper
> lower
>
> Works fine, but of course only upcases ASCII characters.
> However doing this "properly" requires unicode tables, so
> isn't appropriate for a minimal library I guess.
>
> len
>
> Works fine for calculating the string byte length -- which is
> often what is actually wanted -- or calculating the string
> index of the end of the string (for further searching or
> whatever).
>
> rep
> format
>
> Work fine (only use concatenation)
>
> byte
> char
>
> Work fine
>
> find
> match
> gmatch
> gsub
>
> Work fine for the most part. The main exception, of course,
> is single-character wildcards, ".", "[^abc]", etc, when used
> without a repeat suffix -- but I think in practice, these are
> very rarely used without a repeat suffix.
>
> Some of the patterns are limited to ASCII in their
> interpration of course (e.g. "%a"), but this isn't really
> fixable without full unicode tables, and the ASCII-only
> interpretation is not dangerous.
>
> dump
>
> N/A
>
> reverse
>
> Now _this_ will probably simply fail for strings containing
> non-ASCII UTF-8. But it's also probably not very widely
> used...
>
>
> IOW, before trying to come up with some pretty (and expensive)
> abstraction, it seems worthwhile to think: in what _real_ situations
> (i.e., actually occur in practice) does simply "doing nothing" not
> work? In some cases, code might have to be tweaked a little, but I
> suspect it's often enough to just say "so don't do that" (because most
> code doesn't do that anyway).
>
> The main question I suppose is: is the resulting user code, using
> mostly ordinary string functions plus a little minimal utf8 tweaking,
> going to be significantly uglier/harder-to-maintain/confusing, to the
> point where using a heavier-weight abstraction might be worthwhile?
>
> My suspicion is that for most apps, the answer is no...
>
> -miles
>
> --
> Yossarian was moved very deeply by the absolute simplicity of
> this clause of Catch-22 and let out a respectful whistle.
> "That's some catch, that Catch-22," he observed.
> "It's the best there is," Doc Daneeka agreed.
>