Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: David Jones <drj@...>
Date: Thu, 7 Dec 2006 12:52:49 +0000


On 4 Dec 2006, at 18:36, Roberto Ierusalimschy wrote:

It depends on whether you want to use the encoding specified by the
current locale, or always use UTF-8.  The former is a more general

solution and is probably preferred on Unix; GNU/Linuxdistributions are

moving toward UTF-8 anyway.  However, it's problematic on Windows;
someone please correct me if I'm wrong, but I believe that UTF-8 is
never (or rarely) the encoding associated with the system locale on
Windows.  So if you always want to use UTF-8, it's probably better to
use a hand-written converter.


This is actually part of my question :) I guess I would prefer to use

the current locale. But I know nothing about other multibyteencodings,

and so I have no idea whether my code would work for them. For
instance, may I assume that any 0 ends the string?


Yes.

What if the
encoding is state dependent?


Yes.  0 still ends the string.

(It seems a nightmare to handle shift
states when doing backtracking and the like...)


Yes.

Actually dealing with shift-state dependent multi-byte encodings in aportable way in C makes the infinite horrors of Unicode and UTF-8seem very attractive.

drj

Follow-Ups:
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Rici Lake

References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy

Prev by Date: How to get lua error stack in c++
Next by Date: Re: question about Unicode
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread