Re: The World According to Lua: How To?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: The World According to Lua: How To?
From: PA <petite.abeille@...>
Date: Fri, 18 Feb 2005 16:18:40 +0100


On Feb 18, 2005, at 14:06, Klaus Ripke wrote:

moin moin


Like in:

http://www.answers.com/moin+moin&r=67

No less and no more (than "passing around").

Hmmm... right. Not that helpful, isn't it?

It is quite helpful!
In many applications and especially a webserver everything
works perfectly well if you just shuffle around the data without
touching it!

Oh, well... my server is of the type which creates the documents in thefirst place as well.

So where do you have to pay attention to the encoding?
- be careful with string len and sub
 They are ok only for single byte charsets.
 Quite straightforward for double-byte charsets
  - simple use twice the numbers.
 Difficult for multi-byte (variable length) and next to unusable
 for escape-driven encodings. That's why you want to stick
 to UTF-8 and use mine or some other lib.

I'm fine with UTF-8. I would love to simply use UTF-8 everywhere. I'mjust confused about how to reach that goal in a more or lessstraightforward fashion.

- be careful with upper and lower
 as they require exact knowledge fo the charset
 (but are meaningless in most scripts anyway)

Hmmm... lost me here. Why would case conversion be meaningless. Scriptsor no scripts?

- don't rely too much on character classes
 but usually it's enough to treat everything outside
 ASCII as some kind of "letter". E.g. properly recognizing
 non-Latin digits as such wont buy you much as long
 as you don't know how to get a numerical value
 from them (and, yes, that includes roman "digits"!).
- avoid relying on collation

In other words, Lua itself is not Unicode safe one way or another?

It is absolutely "Unicode safe"!
Does not break anything (avoiding sub, upper and lower).

Ok. Do you mean that Lua is 'absolutely "Unicode safe"' as long as youdon't touch any of those strings? This would rather limit theusefulness of Lua, no?

What about today?

Use Tcl or Java, as both have fairly complete support for that
- if you really need it.
With lua, you can deploy external filters like iconv or recode.


Ok.

You mean those:

http://www.gnu.org/software/libiconv/
http://www.gnu.org/software/recode/recode.html

What would be the proper way to subvert all Lua's internal libraries(string, io, etc) to work transparently in terms of the above (or atleast systematically in terms of UTF-8).

depends on what you want to do:
- Passing around is fine with the stock.
- sub, len, upper, lower and char are done in
  http://malete.org/tar/slnutf8.0.8.tar.gz

Ok. But that seems to imply that I have to forgo Lua's string libraryaltogether and use the above library instead. Is that a drop inreplacement? Just asking :))


Maybe you could spell out what you really need?

I have a little app, which create, manipulate and serve textualinformation:


http://alt.textdrive.com/lua/23/lua-lupad

And I simply would like to make sure, before it's too late, that Ihandle character set encoding properly.

After all, there are a lot of issues like character canonicalization
and bidirectional text you maybe haven't ever considered?
If you want to "just have it all in the most proper way",
read all of the Unicode tech reports,
then have a look at IBM's ICU,
and check out why it has to be a multimegabyte library
with a very vast and complicated interface -- pretty unlua.
There is no such thing as a free I18N.

I don't expect it to be free (of pain). Just would like to know whereLua stands in this mess. And act accordantly.

But again, especially for a webserver, you can go to
great lengths in total oblivion.

Not in this case. It's an embedded server which acts as the primaryinterface to the application.


Very much like this application:

http://zoe.nu/
http://zoe.nu/itstories/story.php?data=stories&num=23&sec=2

wait for the encoding extension

Hmmm... where is that fabled "extension"? Got a link?

please allow me one more weekend :)

Sorry. I thought this was something coming in the Lua's master plan orsomething :)

My OS do have a default character set encoding, but how do I
know about it?

It is ISO-8859-1 (Latin-1).

Always? ISO-8859-1?

On *your* OS I guess so.

Ok. What if I don't want to use my OS encoding... perhaps forportability reasons... but would like to systematically use UTF-8instead... what should I do, if anything?

This is useless for three fourth of world.

Right, on *their* OS it's different.


"Theirs"?

Being *nixish, check out LC* and LANG* environment variables,
/etc/locale and similar named, man locale, setlocale et al.

Right... but... I simply would like to have my application work in amore or less portable manner. Anyway to do that in Lua? If the answeris no, that's fine. I just would like to know so I can decide to ditchone tool in favor of another one if necessary. That's all.

Ask string.byte to spell out the values of multiple bytes
(if you really want to know them).

I'm not that interested in the various way to encode the same string. Isimply would like my app to work in Unicode using UTF-8 as its soleencoding. That's all. If this is to cumbersome to achieve, so be it. Ijust would like to know :)

But then, as you are talking about webservers, there's also a simpler
answer working in practice (sorry for becoming very OT here,
we should go private on next occasion):
Convert your static documents once to UTF-8 from whatever chaset they
were written in using iconv or recode and change the meta headers
to spell out "text/html; charset=UTF-8".
Most browsers will send any form data in UTF-8 and voila.
(This is violating the standards; strictly speaking you had to use

multipart/formdata which should send a charset with every littlepiece).

Yes. This is usually what I do in the first place by specifying form'saccept-charset in addition to the document and HTTP header characterset encoding.

All a webserver has to do is pick the right version of the document.
E.g. when asked for /foo/bar.html, see if there is a/fr_FR/foo/bar.html(or /foo/bar.fr_FR.html if you prefer) and if so, use it, else somedefault.
That's all.
Really.

Perhaps there is no document in the first place? Perhaps I'm generatingsuch a document on the fly? Perhaps I would like to generate itaccording to whatever accept-language header I have received?

First, all of these are using Latin-1 anyways.

Latin-1 doesn't work for my potential Japanese users.

If they are asking for fr_FR, they will get by with Latin-1.

They are asking for ja_JP. French will simply not do. Nor any otherLatin-1. They wrote it in Japanese. They want to have it back inJapanese. Or Hindu. Or whatever they use. Or perhaps one document is inKorean. And the next in Sanskrit. And they both need to be displayed onthe same "page". So is life :P

Sound's like you've been expecting the kind of magic the
C locale API promises, but fails to deliver.
It just doesn't work that way.

Ok. I'm fine with that. I just would like to know how it does work inpractice, if it works at all. If Lua is not meant to be used that way,so be it. I just would like to know.


Thanks :)

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/

Follow-Ups:
- Re: The World According to Lua: How To?, Klaus Ripke
- Re: The World According to Lua: How To?, Bernardo Signori

References:
- The World According to Lua: How To?, PA
- Re: The World According to Lua: How To?, Klaus Ripke
- Re: The World According to Lua: How To?, PA
- Re: The World According to Lua: How To?, Klaus Ripke

Prev by Date: Re: The World According to Lua: How To?
Next by Date: C function returning table
Previous by thread: Re: The World According to Lua: How To?
Next by thread: Re: The World According to Lua: How To?
Index(es):
- Date
- Thread