[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: The Lua utf8 library (Was: Issues: Character 160 ...)
- From: Hisham <h@...>
- Date: Wed, 11 Jul 2018 15:00:01 -0300
On 11 July 2018 at 11:45, Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:
>> On Wed, Jul 11, 2018 at 3:10 PM, Hisham <h@hisham.hm> wrote:
>> > Ultimately, the problem is: you would expect utf8.match("name:
>> > %a*%d+", "name: Hélène123") to work, but that doesn't seem feasible to
>> > do without adding Unicode knowledge.
>>
>> Which is a _heavy_ task, given the number of human scripts in common use!
>>
>> By the way, always been curious how non-English Lua people cope with
>> the existing limitations of Lua patterns?
>>
>> Assume 'ASCII' punctuation and work around that?
>
> Mainly. Either your text have all kinds of stuff, and then you need real
> Unicode support, or else everything outside ASCII can be assumed to be
> letters (accented letters and c-cedilla).
Additionally, regional non-UTF-8 locales are not entirely gone, and
Lua supports those via os.setlocale(), based on the C library
setlocale().
$ export HELENE=$(echo "Hélène" | iconv --from-code=UTF-8 --to-code=ISO88591)
$ echo "$HELENE" | hexdump -C
00000000 48 e9 6c e8 6e 65 0a |H.l.ne.|
$ echo "Hélène" | hexdump -C
00000000 48 c3 a9 6c c3 a8 6e 65 0a |H..l..ne.|
$ lua
Lua 5.3.3 Copyright (C) 1994-2016 Lua.org, PUC-Rio
> print(#os.getenv("HELENE"))
6
> helene = "Hélène"
> print(#helene)
8
> print(helene:match("H%a+e")
nil
> print(os.getenv("HELENE"):match("H%a+e"))
nil
> print(os.getenv("HELENE"):upper())
HéLèNE
$ LC_ALL=pt_BR.iso88591 lua
Lua 5.3.3 Copyright (C) 1994-2016 Lua.org, PUC-Rio
> print(os.getenv("HELENE"):match("H%a+e"))
Hélène
> print(os.getenv("HELENE"):upper())
HÉLÈNE
As recently as last year we have dealt with (but not resolved!) bug
reports in LuaFileSystem related to handling filenames, with
os.setlocale() having an effect on the behavior:
https://github.com/keplerproject/luafilesystem/pull/57#issuecomment-282027816
I've also gotten recent reports from people using cyrillic with
non-UTF-8 locales, but I can't recall where.
I recall seeing projects here in Brazil resorting to non-UTF-8
(particuarly in cases where it's a local project where you only need
to support one language). I would never recommend doing this, but
these days you either use ASCII or have to resort to the full bazooka
of Unicode, so it doesn't surprise me to see people taking the ugly
shortcut of regional locales when having to deal with things like
sorting, etc.
-- Hisham