lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I disagree with the statement that UTF-8 is the best option for supporting
Unicode characters. Best is to say that UTF-8 is a good choice for systems
that cannot (or don't want to) be modified to use 16-bit values for a
character.

I will quote Mark Davis, from IBM, and also the president of the Unicode
Consortium to support my point of view:

"Ultimately, the choice of which encoding format to use will depend heavily
on the programming environment. For systems that only offer 8-bit strings
currently, but are multi-byte enabled, UTF-8 may be the best choice. For
systems that do not care about storage requirements, UTF-32 may be best. For
systems such as Windows, Java, or ICU that use UTF-16 strings already,
UTF-16 is the obvious choice. Even if they have not yet upgraded to fully
support surrogates, they will be before long. 

If the programming environment is not an issue, UTF-16 is recommended as a
good compromise between elegance, performance, and storage."

(more info about the differences between UTF-16 and UTF-8 on
http://www-106.ibm.com/developerworks/library/utfencodingforms/)

The Microsoft's OS APIs are natively UNICODE as they use the wchar_t type to
represent characters. wchar_t supports directly UNICODE through UCS-2 and
UTF-16. The Win32 functions use currently UCS-2 internally because this is
more convenient from a programming point of view (i.e, characters are fixed
length).

I think that the statement "storing data on file with anything other than
UTF-8 would IMHO be a mistake" only holds if you're thinking of storing text
in English or Latin languages. If you go to Japanese & Chinese, then you're
talking about 3 bytes per character, which is more than UTF-16 and UCS-2.

It makes a lot of sense to an Operating System to prefer a fixed-length
schema, because of the several issues introduced with variable-length
schemas, such as the performance impact and the required changes on
low-level algorithms (that rely on a known character size). This also seems
to be the case for other class of systems, e.g. DB Servers (MS SQL Server
2000 uses UCS-2 internally, and this approach provides many advantages for a
db system. A non-MS reference is:
http://www-106.ibm.com/developerworks/library/unicode-db-process/index.html)
.

I think it would be better for Lua to support Unicode via wchar_t if all the
target underlying systems could support this. Because this is not the case,
the use of UTF-8 sounds like a reasonable approach. 

Thanks,
-- Anna



-----Original Message-----
From: Jean-Claude Wippler [mailto:jcw@equi4.com]
Sent: Tuesday, February 20, 2001 8:57 AM
To: Multiple recipients of list
Subject: Re: Windows CE 


Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:

[UTF-8]
>But, then, my other question: what is the relationship between Windows CE 
>and Unicode? Why did everybody that tryed to port Lua to Windows CE come up

>with this subject? Why can't they just use this approach (UTF-8)?
>(this is pure ignorance of my part; I know nothing about Windows CE...) 

The machine is based on wchar_t as way to pass strings in and out, so
people tend to think it needs to be that way inside their code as well. 
IMO, this is not the case - it's a conversion issue just like converting
numbers to/from printable form is.  Or closer to home: just like Lua
converts everything back and forth from doubles when it needs to
interface with things outside it.  If conversions are used only for
information going to/from the user interface, and things like file names,
then they need not become a bottleneck.  As I said before, storing data
on file with anything other than UTF-8 would IMHO be a mistake.

I'd say that if WinCE is considered the main universe, then wchar_t makes
sense, but in a broader perspective it does less so.  The choice of
encoding things as 16-bit shorts already causes trouble with >65k char
codes.  UTF-8 is compact, portable, endian-neutral, and capable of
storing unlimited char sets.  It's the equivalent of people writing words
by stringing characters together.

-jcw