lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Friday 18 February 2005 16:18, PA wrote:
> http://www.answers.com/moin+moin&r=67
right, low saxon from east frisia :)

> Oh, well... my server is of the type which creates the documents in the
> first place as well.
but gathering text from some source ...

> I'm fine with UTF-8. I would love to simply use UTF-8 everywhere.
Absolutely no problem and lua is perfect for that.
We're just right now having three projects where we need that,
so you can kind of count on me getting it done.

> > - be careful with upper and lower
> >  (but are meaningless in most scripts anyway)
>
> Hmmm... lost me here. Why would case conversion be meaningless. Scripts
> or no scripts?
oh, script in the sense of phoenician vs. sinitic or brahmi.
Only the tiny small greek fork uses case.
http://web.archive.org/web/20030806053052/czyborra.com/unicode/characters.html#scripts

> Ok. Do you mean that Lua is 'absolutely "Unicode safe"' as long as you
> don't touch any of those strings?
no, as long as you use the proper functions to touch them.

> What would be the proper way to subvert all Lua's internal libraries
> (string, io, etc) to work transparently in terms of the above (or at
> least systematically in terms of UTF-8).
For string, use my little utf8 stuff instead.

For io and sockets it's probably best to use a ltn12-style filter.
The question of what should be the most generic read/write
interface regardless of io(file), socket or filter around any of
those is more general and afaik no yet settled (?);
once it's there, a recoding filter is just one special case.

Anyway, there's no need to subvert anything here.
If you really want all of io to transparently recode,
that would mean without explicitly stating which charset is
on the other end (involving some guesswork about "system encoding"),
then it still can be done using the usual wrapping techniques.
But I'd always go for an explicit interface, as guesswork is bound to fail.

Nothing else would need to be touched.


> Ok. But that seems to imply that I have to forgo Lua's string library
> altogether and use the above library instead. Is that a drop in
> replacement? Just asking :))
In principle yes -- to the extend where any
data used with string is really UTF-8 character data
(and once the mentioned todos are done).
In practice this is looking for trouble, as it would break other
libs using string to cut arbitrary bytes out of something.

> http://alt.textdrive.com/lua/23/lua-lupad
fine, fine TextDrive staff :)

> And I simply would like to make sure, before it's too late, that I
> handle character set encoding properly.
doesn't look like there's too much need to manipulate the texts? 

> Just would like to know where Lua stands in this mess.
I'd say in the midth of nowhere, which is a fine place to be

> Ok. What if I don't want to use my OS encoding...
right. forget about the OS.

> but would like to systematically use UTF-8 instead...
> what should I do, if anything?
gather data in UTF-8 from the first minute
and save it as is without applying any recoding.

> Right... but... I simply would like to have my application work in a
> more or less portable manner. Anyway to do that in Lua?
key to portability is to just not ask the OS to get it.
Then there's all fine with lua, as it won't break anything either.


> such a document on the fly? Perhaps I would like to generate it
> according to whatever accept-language header I have received?
you're going for automatic translation?

> They want to have it back in Japanese. Or Hindu. Or whatever they use.
> Or perhaps one document is in Korean. And the next in Sanskrit.
> And they both need to be displayed on the same "page". So is life :P
no problem