lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Feb 18, 2005, at 14:06, Klaus Ripke wrote:

moin moin

Like in:

No less and no more (than "passing around").
Hmmm... right. Not that helpful, isn't it?
It is quite helpful!
In many applications and especially a webserver everything
works perfectly well if you just shuffle around the data without
touching it!

Oh, well... my server is of the type which creates the documents in the first place as well.

So where do you have to pay attention to the encoding?
- be careful with string len and sub
 They are ok only for single byte charsets.
 Quite straightforward for double-byte charsets
  - simple use twice the numbers.
 Difficult for multi-byte (variable length) and next to unusable
 for escape-driven encodings. That's why you want to stick
 to UTF-8 and use mine or some other lib.

I'm fine with UTF-8. I would love to simply use UTF-8 everywhere. I'm just confused about how to reach that goal in a more or less straightforward fashion.

- be careful with upper and lower
 as they require exact knowledge fo the charset
 (but are meaningless in most scripts anyway)

Hmmm... lost me here. Why would case conversion be meaningless. Scripts or no scripts?

- don't rely too much on character classes
 but usually it's enough to treat everything outside
 ASCII as some kind of "letter". E.g. properly recognizing
 non-Latin digits as such wont buy you much as long
 as you don't know how to get a numerical value
 from them (and, yes, that includes roman "digits"!).
- avoid relying on collation

In other words, Lua itself is not Unicode safe one way or another?
It is absolutely "Unicode safe"!
Does not break anything (avoiding sub, upper and lower).

Ok. Do you mean that Lua is 'absolutely "Unicode safe"' as long as you don't touch any of those strings? This would rather limit the usefulness of Lua, no?

What about today?
Use Tcl or Java, as both have fairly complete support for that
- if you really need it.
With lua, you can deploy external filters like iconv or recode.


You mean those:

What would be the proper way to subvert all Lua's internal libraries (string, io, etc) to work transparently in terms of the above (or at least systematically in terms of UTF-8).

depends on what you want to do:
- Passing around is fine with the stock.
- sub, len, upper, lower and char are done in

Ok. But that seems to imply that I have to forgo Lua's string library altogether and use the above library instead. Is that a drop in replacement? Just asking :))

Maybe you could spell out what you really need?

I have a little app, which create, manipulate and serve textual information:

And I simply would like to make sure, before it's too late, that I handle character set encoding properly.

After all, there are a lot of issues like character canonicalization
and bidirectional text you maybe haven't ever considered?
If you want to "just have it all in the most proper way",
read all of the Unicode tech reports,
then have a look at IBM's ICU,
and check out why it has to be a multimegabyte library
with a very vast and complicated interface -- pretty unlua.
There is no such thing as a free I18N.

I don't expect it to be free (of pain). Just would like to know where Lua stands in this mess. And act accordantly.

But again, especially for a webserver, you can go to
great lengths in total oblivion.

Not in this case. It's an embedded server which acts as the primary interface to the application.

Very much like this application:

wait for the encoding extension
Hmmm... where is that fabled "extension"? Got a link?
please allow me one more weekend :)

Sorry. I thought this was something coming in the Lua's master plan or something :)

My OS do have a default character set encoding, but how do I
know about it?
It is ISO-8859-1 (Latin-1).
Always? ISO-8859-1?
On *your* OS I guess so.

Ok. What if I don't want to use my OS encoding... perhaps for portability reasons... but would like to systematically use UTF-8 instead... what should I do, if anything?

This is useless for three fourth of world.
Right, on *their* OS it's different.


Being *nixish, check out LC* and LANG* environment variables,
/etc/locale and similar named, man locale, setlocale et al.

Right... but... I simply would like to have my application work in a more or less portable manner. Anyway to do that in Lua? If the answer is no, that's fine. I just would like to know so I can decide to ditch one tool in favor of another one if necessary. That's all.

Ask string.byte to spell out the values of multiple bytes
(if you really want to know them).

I'm not that interested in the various way to encode the same string. I simply would like my app to work in Unicode using UTF-8 as its sole encoding. That's all. If this is to cumbersome to achieve, so be it. I just would like to know :)

But then, as you are talking about webservers, there's also a simpler
answer working in practice (sorry for becoming very OT here,
we should go private on next occasion):
Convert your static documents once to UTF-8 from whatever chaset they
were written in using iconv or recode and change the meta headers
to spell out "text/html; charset=UTF-8".
Most browsers will send any form data in UTF-8 and voila.
(This is violating the standards; strictly speaking you had to use
multipart/formdata which should send a charset with every little piece).

Yes. This is usually what I do in the first place by specifying form's accept-charset in addition to the document and HTTP header character set encoding.

All a webserver has to do is pick the right version of the document.
E.g. when asked for /foo/bar.html, see if there is a /fr_FR/foo/bar.html (or /foo/bar.fr_FR.html if you prefer) and if so, use it, else some default.
That's all.

Perhaps there is no document in the first place? Perhaps I'm generating such a document on the fly? Perhaps I would like to generate it according to whatever accept-language header I have received?

First, all of these are using Latin-1 anyways.
Latin-1 doesn't work for my potential Japanese users.
If they are asking for fr_FR, they will get by with Latin-1.

They are asking for ja_JP. French will simply not do. Nor any other Latin-1. They wrote it in Japanese. They want to have it back in Japanese. Or Hindu. Or whatever they use. Or perhaps one document is in Korean. And the next in Sanskrit. And they both need to be displayed on the same "page". So is life :P

Sound's like you've been expecting the kind of magic the
C locale API promises, but fails to deliver.
It just doesn't work that way.

Ok. I'm fine with that. I just would like to know how it does work in practice, if it works at all. If Lua is not meant to be used that way, so be it. I just would like to know.

Thanks :)


PA, Onnay Equitursay