[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: is string.gmatch(), string.upper() 7-bit ascii only?
- From: Coda Highland <chighland@...>
- Date: Fri, 8 Apr 2016 13:44:50 -0700
On Fri, Apr 8, 2016 at 3:54 AM, Michal Kottman <michal.kottman@gmail.com> wrote:
>
> On 7 April 2016 at 16:40, Marc Balmer <marc@msys.ch> wrote:
>>
>> (Expected is 'ÄÖÜ')
>
>
> You need an external library which understands all of Unicode. Not
> advocating anything in particular, just picking a first 'Lua utf8' library
> that a simple web search returned:
>
> $ sudo luarocks install luautf8
> $ lua
> Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio
>> utf8 = require 'lua-utf8'
>> = utf8.upper('äöü')
> ÄÖÜ
>> for m in utf8.gmatch('äöü', '%g+') do print(m) end
> äöü
>
It should be noted that "uppercase" isn't always a trivial matter anyway.
In Turkish and some other languages borrowing from its script, the
uppercase form of i is İ, and the lowercase form of I is ı. These are
distinct vowels, but Unicode only encodes the glyphs in this case, not
the semantic distinction between the two. (Unicode is an ugly mess in
that way -- sometimes it encodes identical but semantically-distinct
characters separately; sometimes it combines them.) I and i are still
U+0049 and U+0069, respectively, but İ is U+0130 and ı is U+0131.
/s/ Adam
/s/ Adam