Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: "Russ Cox" <rsc@...>
Date: Thu, 7 Dec 2006 18:03:42 -0500

These are my opinions, but they are the result of lots of time
working with these issues.

0.  http://cm.bell-labs.com/sys/doc/utf.html is very much worth reading.

1.  You should give up on trying to write an identifier name
in one character set in one file and referring to it using
a different character set in another source file.
That's asking far too much.

2.  The Plan 9 compilers (the first compilers to support UTF-8)
have had good success with just treating any bytes
in 80-FF as valid identifier characters, just as was
proposed in this thread.  The beauty of this is that it is
locale-independent and works for any locale preserving
7-bit ASCII.  If other people want to write all their source files
in Latin N, that's okay too.  The compiler won't care.
(Ditto if the Unicode consortium decides to add even more
code points.)

3.  The only sane external format is UTF-8.  Otherwise you
have to worry about byte order and also getting lost midway
through an encoding.  Just use UTF-8.  Thankfully this is rather
hard *not* to do, except on Windows, and Windows is slowly
getting better.

4.  Applications that need to treat strings as just opaque
identifiers that can be copied and perhaps compared but not
otherwise broken down should use UTF-8 internally as char* strings.
Strcmp, strchr (for ASCII characters), etc. just work
(as long as your tools are 8-bit safe, which they all are now).

5.  Applications whose job is text processing typically are easier
working with internal arrays of characters rather than UTF-8
(but they should still read and write UTF-8 externally!).
The exact details of which data type you use to hold your
character values is up to your application.  16-bit integer (if you
don't care about the new Unicode points), 32-bit integer,
and even double-precision floating point (if you use Lua)
are all perfectly fine, with 16-bit being perhaps somewhat
less than ideal (now that Unicode has bloated some)
but still more efficient.

Russ

Follow-Ups:
- Re: question about Unicode, David Given
- Re: question about Unicode, Glenn Maynard

References:
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez
- Re: question about Unicode, Asko Kauppi
- Re: question about Unicode, Brian Weed
- Re: question about Unicode, Glenn Maynard

Prev by Date: Re: question about Unicode
Next by Date: Re: Serializing Lua Functions
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread