lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On Feb 18, 2005, at 17:40, Klaus Ripke wrote:

Oh, well... my server is of the type which creates the documents in the
first place as well.
but gathering text from some source ...

Yes. Usually a HTTP POST.

I'm fine with UTF-8. I would love to simply use UTF-8 everywhere.
Absolutely no problem and lua is perfect for that.
We're just right now having three projects where we need that,
so you can kind of count on me getting it done.

Good :)

- be careful with upper and lower
 (but are meaningless in most scripts anyway)

Hmmm... lost me here. Why would case conversion be meaningless. Scripts
or no scripts?
oh, script in the sense of phoenician vs. sinitic or brahmi.
Only the tiny small greek fork uses case. characters.html#scripts

I see. I don't mind too much if Coptic subtleties are not fully accounted for :P

For string, use my little utf8 stuff instead.

Ok. Will take a closer look.

For io and sockets it's probably best to use a ltn12-style filter.
The question of what should be the most generic read/write
interface regardless of io(file), socket or filter around any of
those is more general and afaik no yet settled (?);
once it's there, a recoding filter is just one special case.

Anyway, there's no need to subvert anything here.
If you really want all of io to transparently recode,
that would mean without explicitly stating which charset is
on the other end (involving some guesswork about "system encoding"),
then it still can be done using the usual wrapping techniques.
But I'd always go for an explicit interface, as guesswork is bound to fail.

Ok. I would rather use an explicit encoding everywhere. UTF-8 is just fine with me.

Nothing else would need to be touched.


Ok. But that seems to imply that I have to forgo Lua's string library
altogether and use the above library instead. Is that a drop in
replacement? Just asking :))
In principle yes -- to the extend where any
data used with string is really UTF-8 character data
(and once the mentioned todos are done).
In practice this is looking for trouble, as it would break other
libs using string to cut arbitrary bytes out of something.

So... what does that mean? Most (all?) Lua extensions are build in terms of the default string API, no? Does the entire string library needs to be hijacked?
fine, fine TextDrive staff :)

Staff or stuff? :P

And I simply would like to make sure, before it's too late, that I
handle character set encoding properly.
doesn't look like there's too much need to manipulate the texts?

Indexing? Search? Reading and writing to disk? Reading and writing to sockets?

Just would like to know where Lua stands in this mess.
I'd say in the midth of nowhere, which is a fine place to be


Ok. What if I don't want to use my OS encoding...
right. forget about the OS.

but would like to systematically use UTF-8 instead...
what should I do, if anything?
gather data in UTF-8 from the first minute
and save it as is without applying any recoding.

Ok. I can do that. I think :P

Right... but... I simply would like to have my application work in a
more or less portable manner. Anyway to do that in Lua?
key to portability is to just not ask the OS to get it.
Then there's all fine with lua, as it won't break anything either.

Well... I need to read this data back from the disk at least.

such a document on the fly? Perhaps I would like to generate it
according to whatever accept-language header I have received?
you're going for automatic translation?

No, no. This is a simple application to write textual notes. The notes can be in any language. The client accessing the notes can be in any language. The notes are always displayed in their original format (e.g. UTF-8). The UI itself can change depending on the client language header.

They want to have it back in Japanese. Or Hindu. Or whatever they use.
Or perhaps one document is in Korean. And the next in Sanskrit.
And they both need to be displayed on the same "page". So is life :P
no problem

Ok. This is what I mean:

In the application above, everything is indeed seamlessly handled as far as encoding goes.


PA, Onnay Equitursay