lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 19 April 2014 11:03, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> 2014-04-19 10:20 GMT+02:00 Philipp Janda <siffiejoe@gmx.net>:
>> Am 19.04.2014 09:47 schröbte Dirk Laurie:
>>> The proposal allows for customizable character classes. We already
>>> have that. Nothing (except the vast effort of actually doing it) stops you
>>> from defining your own locale ...
>> Do you think that locales were a good idea? We inherited those from C but
>> there's no reason to make the same mistake again just because C made it
>> decades ago.
>
> Whether they are a good idea or not, they are there, accessible from
> Lua. And if they are there, somebody will use them.
>
> You come from a country (I guess) where people would expect
> string.upper"über" to come out as "ÜBER". Is that such a very bad idea?

Unfortunately that doesn't work in modern locales, right? At least I
couldn't get "ÜBER" out of ("über"):upper() here, after trying several
combinations of values of os.setlocale, $LC_ALL and different terminal
emulators. I'm sure I could get it to work if I configured my whole
system (display, encoding, input) to ISO-8859-X, but it's a pain (and
then other things break).

I think at some point in the future it won't make sense to talk about
single-byte encodings. (The future doesn't arrive everywhere at the
same time, of course — in some places this future has already arrived,
in others it will take a long time). In this future, three things stop
making sense in the Lua API:

* string.lower
* string.upper
* % character classes in patterns.

As far as I can see these are the only *text* oriented features of the
string library; the rest of it is an 8-bit clean, locale-agnostic,
bytestream library. (I say that as a compliment, the fact that this
list is so small is a testament to the genericity of the string
library!)

Even then, I think bytestream patterns still make sense in the string
library (save for % character classes), as much as codepoint patterns
make sense in the utf8 library.

Uppercase, lowercase and character classification belong in Unicode,
at a higher level of abstraction.

Still, having string.upper and string.lower around even when you don't
want a full Unicode library linked in is undoubtedly useful (we use
ASCII a lot when programming after all; keywords, etc.).

But since 8-bit encodings are becoming a relic of the past, I wonder
if at some point it won't be saner/better-defined to restrict the
specification of string.upper and string.lower as affecting
[a-z]/[A-Z] only. I _think_ that's what already happens when running
in UTF-8 locales (it would be nice to be sure).

-- Hisham