lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

moin moin

On Friday 18 February 2005 11:53, PA wrote:
> > No less and no more (than "passing around").
> Hmmm... right. Not that helpful, isn't it?
It is quite helpful!
In many applications and especially a webserver everything
works perfectly well if you just shuffle around the data without
touching it!
You can dish out documents written using a dozen different
charsets without any concern! As long as every document
bears the proper content-type (at least in the meta header),
the user's browsers will get it right.
In the same way you use programmatic messages from some
message catalogue (my recommendation is a cdb) and store
and echo user's form input without regard for the encoding.

So where do you have to pay attention to the encoding?
- be careful with string len and sub
 They are ok only for single byte charsets.
 Quite straightforward for double-byte charsets
  - simple use twice the numbers.
 Difficult for multi-byte (variable length) and next to unusable
 for escape-driven encodings. That's why you want to stick
 to UTF-8 and use mine or some other lib.
- be careful with upper and lower
 as they require exact knowledge fo the charset
 (but are meaningless in most scripts anyway)
- don't rely too much on character classes
 but usually it's enough to treat everything outside
 ASCII as some kind of "letter". E.g. properly recognizing
 non-Latin digits as such wont buy you much as long
 as you don't know how to get a numerical value
 from them (and, yes, that includes roman "digits"!).
- avoid relying on collation

> In other words, Lua itself is not Unicode safe one way or another?
It is absolutely "Unicode safe"!
Does not break anything (avoiding sub, upper and lower).

> > or accross the wire. Yet, a recoding extension is scheduled,
> This is futureware, right?
Right, but not far away.

> What about today?
Use Tcl or Java, as both have fairly complete support for that
- if you really need it.
With lua, you can deploy external filters like iconv or recode.

> This is what I would like to do, yes. How do I achieve that today with
> the stock Lua distribution?
depends on what you want to do:
- Passing around is fine with the stock.
- sub, len, upper, lower and char are done in

Maybe you could spell out what you really need?
After all, there are a lot of issues like character canonicalization
and bidirectional text you maybe haven't ever considered?
If you want to "just have it all in the most proper way",
read all of the Unicode tech reports,
then have a look at IBM's ICU,
and check out why it has to be a multimegabyte library
with a very vast and complicated interface -- pretty unlua.
There is no such thing as a free I18N.

But again, especially for a webserver, you can go to
great lengths in total oblivion.

> > wait for the encoding extension
> Hmmm... where is that fabled "extension"? Got a link?
please allow me one more weekend :)

> >> When I do aFile:read( "*all" ), what do I get back as far as character
> > whatever is in there
> Not very helpful, isn't it?
Yes it is.

> >> My OS do have a default character set encoding, but how do I
> >> know about it?
> > It is ISO-8859-1 (Latin-1).
> Always? ISO-8859-1?
On *your* OS I guess so.

> This is useless for three fourth of world.
Right, on *their* OS it's different.
Being *nixish, check out LC* and LANG* environment variables,
/etc/locale and similar named, man locale, setlocale et al.

> > Bytes don't have no encoding.
> What about sequences of bytes?
Ask string.byte to spell out the values of multiple bytes
(if you really want to know them).

> Ok. How? I would like my application to always deal with UTF-8
> internally. And convert everything and anything coming its way to it.
> How?
see above

But then, as you are talking about webservers, there's also a simpler
answer working in practice (sorry for becoming very OT here,
we should go private on next occasion):
Convert your static documents once to UTF-8 from whatever chaset they
were written in using iconv or recode and change the meta headers
to spell out "text/html; charset=UTF-8".
Most browsers will send any form data in UTF-8 and voila.
(This is violating the standards; strictly speaking you had to use
multipart/formdata which should send a charset with every little piece).

> You mean your recently mentioned UTF-8 library?

> This is the extension you are talking about?
no, recoding is a different issue

> Is Lua itself going to ever support Unicode directly one way or another?
"itself"? utf8 is just a lib like string is, which is perfectly fine,
and all-in-one packages a la LuaCheia may one day include it.

> Ok. Any alternatives? How do people deal with locales then?
> Just pretend they are not there?

> >> Accept-Language header. One request is in de_DE, the next one in fr_FR
> >> and so on, while the application default language is en_US. How does
> Hmmm... what if I do care?
All a webserver has to do is pick the right version of the document.
E.g. when asked for /foo/bar.html, see if there is a /fr_FR/foo/bar.html
(or /foo/bar.fr_FR.html if you prefer) and if so, use it, else some default.
That's all.

> > First, all of these are using Latin-1 anyways.
> Latin-1 doesn't work for my potential Japanese users.
If they are asking for fr_FR, they will get by with Latin-1.

> > If you happen to have russian text in KOI on your server,
> > you ought to know that.
> Sorry, you totally lost me here.
If there is somebody doing the russian translations for you,
they will know which charset they are using.

> In any case, what seems to emerge from all this mess is that Lua is
> simply not ready for prime time as far as i18n goes.
It is perfectly well ready, as I18N is a mere library issue.

> Is that a fair assessment or did I miss something obvious as usual?
Sound's like you've been expecting the kind of magic the
C locale API promises, but fails to deliver.
It just doesn't work that way.