[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 patterns in Lua 5.3
- From: William Ahern <william@...>
- Date: Thu, 17 Apr 2014 08:11:56 -0700
On Thu, Apr 17, 2014 at 09:40:26AM +0200, Oliver Kroth wrote:
> to my knowledge in most "big" OS, there are already libraries for
> handling Unicode semantics.
> I'd like to propose to let Lua do the UTF-8 encoding matters, and use a
> (probably OS-specific) glue library to refer the Unicode semantics to
> the underlying OS. This library may e.g. be named "unicode" to avoid
> name clashes with utf8.
>
> There is no sense in re-inventing the wheel.
Actually, most operating systems don't _fully_ support Unicode
out-of-the-box, including neither Windows nor Linux.
The issue is that Unicode is fundamentally multibyte, yet OS character
primitives assume that every character fits into a single codepoint. Many
characters have no precompsoed equivalent, so they're always more than one
codepoint, no matter how wide your datatype (32-bits, 64-bits), etc. In
other words, something like iswalpha() won't always work, because the actual
grapheme (e.g. letter + combining character) is two codepoints, and a naive
engine will consume the first codepoint but not the next.
Proper Unicode libraries, like ICU, allow you to operate on strings (or
vectors of codepoints)--"give me the next paragraph", "give me the next
line", "give me the next series of non-alphabetic characters", etc.
Perl6 solves this is a unique way by inventing a new normalization form. For
every sequence of codepoints that can't be resolved to a precomposed
character, it dynamically generates a new [non-Unicode] precomposed
character. That allows intuitive processing of Unicode text for the most
common cases. For stuff like word boundaries, there's no easy solution,
because some languages (e.g. Thai, IIRC) don't have breaks between words
(word boundaries are based on understanding syllable rules).