[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 patterns in Lua 5.3
- From: Hisham <h@...>
- Date: Sat, 19 Apr 2014 16:36:55 -0300
On 19 April 2014 11:03, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> 2014-04-19 10:20 GMT+02:00 Philipp Janda <siffiejoe@gmx.net>:
>> Am 19.04.2014 09:47 schröbte Dirk Laurie:
>>> The proposal allows for customizable character classes. We already
>>> have that. Nothing (except the vast effort of actually doing it) stops you
>>> from defining your own locale ...
>> Do you think that locales were a good idea? We inherited those from C but
>> there's no reason to make the same mistake again just because C made it
>> decades ago.
>
> Whether they are a good idea or not, they are there, accessible from
> Lua. And if they are there, somebody will use them.
>
> You come from a country (I guess) where people would expect
> string.upper"über" to come out as "ÜBER". Is that such a very bad idea?
Unfortunately that doesn't work in modern locales, right? At least I
couldn't get "ÜBER" out of ("über"):upper() here, after trying several
combinations of values of os.setlocale, $LC_ALL and different terminal
emulators. I'm sure I could get it to work if I configured my whole
system (display, encoding, input) to ISO-8859-X, but it's a pain (and
then other things break).
I think at some point in the future it won't make sense to talk about
single-byte encodings. (The future doesn't arrive everywhere at the
same time, of course — in some places this future has already arrived,
in others it will take a long time). In this future, three things stop
making sense in the Lua API:
* string.lower
* string.upper
* % character classes in patterns.
As far as I can see these are the only *text* oriented features of the
string library; the rest of it is an 8-bit clean, locale-agnostic,
bytestream library. (I say that as a compliment, the fact that this
list is so small is a testament to the genericity of the string
library!)
Even then, I think bytestream patterns still make sense in the string
library (save for % character classes), as much as codepoint patterns
make sense in the utf8 library.
Uppercase, lowercase and character classification belong in Unicode,
at a higher level of abstraction.
Still, having string.upper and string.lower around even when you don't
want a full Unicode library linked in is undoubtedly useful (we use
ASCII a lot when programming after all; keywords, etc.).
But since 8-bit encodings are becoming a relic of the past, I wonder
if at some point it won't be saner/better-defined to restrict the
specification of string.upper and string.lower as affecting
[a-z]/[A-Z] only. I _think_ that's what already happens when running
in UTF-8 locales (it would be nice to be sure).
-- Hisham
- References:
- UTF-8 patterns in Lua 5.3, Hisham
- Re: UTF-8 patterns in Lua 5.3, Keith Matthews
- Re: UTF-8 patterns in Lua 5.3, Hisham
- Re: UTF-8 patterns in Lua 5.3, Keith Matthews
- Re: UTF-8 patterns in Lua 5.3, Hisham
- Re: UTF-8 patterns in Lua 5.3, Dirk Laurie
- Re: UTF-8 patterns in Lua 5.3, Philipp Janda
- Re: UTF-8 patterns in Lua 5.3, Dirk Laurie
- Re: UTF-8 patterns in Lua 5.3, Philipp Janda
- Re: UTF-8 patterns in Lua 5.3, Dirk Laurie