Re: Clearing up misconceptions about characters vs bytes in the manual

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Clearing up misconceptions about characters vs bytes in the manual
From: Rena <hyperhacker@...>
Date: Fri, 2 Nov 2012 17:11:32 -0400

On 2012-11-02 2:56 PM, "spir" <denis.spir@gmail.com> wrote:
>
> On 02/11/2012 17:11, M. Edward (Ed) Borasky wrote:
>>
>> Unicode in general and UTF-8 in particular are quickly becoming
>> indispensable and Lua programmers need a standardized way of dealing
>> with them, either in libraries or in extensions to the language syntax
>> and semantics. Personally I favor libraries since they can be
>> blazingly fast and don't break existing code. But they do need to be
>> there and work.
>
>
> I planned for a while to work only with genuine unicode-aware libraries for text processing (and I even had a prototype for one such lib in and for Lua). however, I had to go back to plain byte string for the following reason: unicode abstract characters, that is what a unicode code represents, are not characters. What they are is what the standard team decided to encode. There are as one expects simple, base characters such as 'a', control codes, a bunch of eosteric special codes, and tons of *combining* codes which form *actual characters* when composed with base codes.
> This means that a character is represented by a suite of n code (n has no formal limit), each encoded a 1-4 bytes in utf-8. To add a bit a complication, unicode (or rather UCS) alse defines precomposed ocdes for precomposed characters. Which means the letter 'â' may be UCS-coded (in code points, not bytes) as 1 single code or 2 code, 1 for bas 'a', one for combining '^'. I guess you start to imagine the mess to get things right and safe.
> For instance, how does one search for a word with 'â'? We need to first normalise to decomposed form (which is faster and also has the advantage of informing about sub-character units such as '^'); but this require goruping codes into characters and sorting them (yes, order of combinants is not defined, axcept for the base, and htere are exceptions). All of this, after decoding from utf8 to a string of unicode codes.
> This is doable, but much complication, I guess. Maybe I used a wrong approach, but after tons of exchanges on the topic with experts, no one could find a better solution.
> There is, I guess, no hope to get back the ideal simplicity of 1 char <--> 1 repr (and even less representations of equal lengths) we lived with in ascii & iso-latin times. There is affordable way to get strings as a sequences of chars, with s[i] = ith char, exactly, and complete.
>
> Denis
>
> PS: The reasons why were introduced composite codes (which are the core source of the issue, for me, else characters would have a single representation), in addition of palin decomposed forms which are the base UCS coding, and why is used a misleading term like "abstract character" remain unknwon to me.
>

I think the reason combining characters exist is that in some languages the number of valid combinations is quite huge. Korean writing for one example has each character made by combining multiple base characters.

I feel though that if there weren't both precomposed characters *and* combining glyphs in Unicode, *and* the combining glyphs were easily identifiable in some way (maybe some control code marking beginning/end of composite character, or bitflag marking combining glyphs), the issue would be much less a problem... and if multiple combining glyphs always followed an ordering rule, such as order by code point, searching would be easier as well. But I'm not sure the latter can be done, when you consider (again) Korean, where the same glyph can appear in many places in a character, and the ordering might define which goes where... Maybe we should just switch everyone to Lojban?

Anyway, the clusterpork that is Unicode combining glyphs is not really a Lua bug... Probably we should be pestering the authors of the spec.

Follow-Ups:
- Re: Clearing up misconceptions about characters vs bytes in the manual, Coda Highland

References:
- Clearing up misconceptions about characters vs bytes in the manual, Rob Hoelz
- Re: Clearing up misconceptions about characters vs bytes in the manual, Rapin Patrick
- Re: Clearing up misconceptions about characters vs bytes in the manual, M. Edward (Ed) Borasky
- Re: Clearing up misconceptions about characters vs bytes in the manual, spir

Prev by Date: Re: Bug: Literal strings in long format are not quite literal.
Next by Date: Re: Clearing up misconceptions about characters vs bytes in the manual
Previous by thread: Re: Clearing up misconceptions about characters vs bytes in the manual
Next by thread: Re: Clearing up misconceptions about characters vs bytes in the manual
Index(es):
- Date
- Thread