lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi

On Friday 18 February 2005 09:06, PA wrote:
> related to I18N (aka internationalization) come knocking on my door
... with an axe, I assume

> a total loss on how to even start to handle any of them in Lua's
> World...
here we go

> First and foremost, character set encoding... I have been through the
> entire mailing list back and forth, but still no luck... Lua is rumored
> to be "8bit clean"... fine... but what does that mean as far as
> character set goes?
It means that it can pass around data in any charset you like.
No less and no more (than "passing around").

> Bits and bytes (clean or not) are utterly useless
> if I don't know what character set they do represent in practice.
unclean they're even less useful

> character set does Lua use effectively?!?!? How do I find out about it?
Lua uses the locale sensitive single-byte C API.
So you find out in the C89/C90 standard documents.
(e.g. http://danpop.home.cern.ch/danpop/ansi.c)

There is no builtin way to usefully deal with multibyte charsets,
however, extensions like my UTF-8 stuff can use the builtin string type
without any trouble.

There also is absolutely no recoding support builtin,
you will get just whatever character coding came in from your files
or accross the wire. Yet, a recoding extension is scheduled,
and wrapping it around standard files and sockets to make them
appear magically recoded is no big deal.
For a truly i18n app it's probably easiest to always use UTF-8
internally.

The behaviour of your single-byte charset, however,
is affected at a few places by your locale setting,
most important the meaning of character classes
and the sorting order in string comparision:

$ echo 'os.setlocale("C") print(string.find(string.char(193),"%a"))'|lua
nil
$ echo 'os.setlocale("en_US.ISO-8859-1") 
print(string.find(string.char(193),"%a"))'|lua
1       1
$ echo 'os.setlocale("en_US.UTF-8") 
print(string.find(string.char(193),"%a"))'|lua
nil

given the single-byte API, you can not match anything
interesting in UTF-8 (or Big5 ...). Use the UTF-8 extension.
string.len and sub will work ok for any single-byte charset,
but the latter will happily kill multi-byte encodings.

> How do I convert things back and forth to it?
wait for the encoding extension

> How do I set it in the first place?
see above

> When I do aFile:read( "*all" ), what do I get back as far as character
> set goes?
whatever is in there

> My OS do have a default character set encoding, but how do I
> know about it?
It is ISO-8859-1 (Latin-1).

> When I do aFile:write( aContent ), what am I writing?!?
aContent
> What does string.byte() returns?
the first byte's value
> In what encoding?
Bytes don't have no encoding.

> Then there is the fabled "setlocale" and its related idiosyncrasies...
ouch
> The Man hints that this may very well be the answer to many unanswered
> questions...
not really

> but... how precisely does setlocale relates to character
> set encoding, if at all?
ctypes and collation

> How do I tell Lua that everything I want to
> deal with is UTF-8 encoded and that is it?!?!
you don't - Lua doesn't care. Use the extension.

> Then there is the issue of setlocale scope... does it impact the entire
> VM?
yep

> How do I handle several locales concurrently?
you don't
well, you may switch back and forth,
but don't try to do this in a multithreaded app.

C API locale support is just braindead.
It was meant to enable NLS in existing applications which had been
written without being aware of these issues.
Since NLS is not that simple, it didn't work out.

> For instance, lets
> assume that my application display its data according to HTTP's
> Accept-Language header. One request is in de_DE, the next one in fr_FR
> and so on, while the application default language is en_US. How does
> all this fit together?
It doesn't - you just don't care.
First, all of these are using Latin-1 anyways.
Pick one of "ja" and "oui" and "yerpo".
Second, set the document's content type to "text/html; charset=ISO-8859-1"
in your webserver config and better also in the documents header.
Use other charsets, including UTF-8, accordingly.
If you happen to have russian text in KOI on your server,
you ought to know that.

> Could any kind soul point me to any coherent 
> resources which may shed some sense on any of this?
http://www.i18nguy.com/

cheers