Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: David Given <dg@...>
Date: Fri, 08 Dec 2006 00:25:13 +0000

Russ Cox wrote:
> These are my opinions, but they are the result of lots of time
> working with these issues.

Eek! I didn't mean to start such a debate... I appear to have struck a nerve!

[...]
> 1.  You should give up on trying to write an identifier name
> in one character set in one file and referring to it using
> a different character set in another source file.

I think this is reasonable. It fits the Lua philosophy to declare a simple
mechanism (the one I proposed) that *allows* source files to be in whatever
ASCII-compatible encoding you like, but doesn't require one. This allows users
to use UTF-8 or Latin-1 or Shift-JIS or whatever they want. It doesn't solve
the issue of what happens if the user wants to do something complicated, like
mix encodings --- I think it's fair to require the user to think first when
doing that. If all else fails, it's easy enough to just run your source
through iconv first.

It also doesn't solve the normalisation problem, which is potentially quite
serious, but I don't think that's solvable without introducing UTF-8 specific
behaviour.

It will also fail on any encoding that uses low-bit characters as part of an
extended sequence. If there's an encoding that uses <high> <low1> <low2> as
part of a single character, then <low1> and <low2> may potentially confuse the
parser. This scheme would only work on encodings where *all* bytes of an
extended character have the top bit set. I believe that includes Shift-JIS as
well as UTF-8.

And it doesn't make the string library support anything other than ASCII, but
then I don't think it's the default string library's *job* to do that.

I agree with everything else you say, BTW, except that I usually like
processing strings as UTF-8. It's slower than UTF-16, but it does force you to
get it right in order to get it done at all, it's much less memory-hungry
(particularly for western languages), and in most cases it's fast enough.

...

BTW, if you want to see true madness, check out the other UTF forms. UTF-7 is
bizarre enough. There were plans for UTF-5 for legacy teletype systems and
radio (it's compatible with baudot code). And as for UTF-EBCDIC...

-- 
╭─┈David Given┈──McQ─╮ "...electrons, nuclei and other particles are good
│┈ dg@cowlark.com┈┈┈┈│ approximations to perfectly elastic spherical
│┈(dg@tao-group.com)┈│ cows." --- David M. Palmer on r.a.sf.c
╰─┈www.cowlark.com┈──╯

Attachment: signature.asc
Description: OpenPGP digital signature

Follow-Ups:
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Russ Cox

References:
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez
- Re: question about Unicode, Asko Kauppi
- Re: question about Unicode, Brian Weed
- Re: question about Unicode, Glenn Maynard
- Re: question about Unicode, Russ Cox

Prev by Date: Re: question about Unicode
Next by Date: Re: question about Unicode
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread