Unicode (long, semi-rant)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Unicode (long, semi-rant)
From: RLake@...
Date: Fri, 24 May 2002 16:07:37 -0500

Curt Carpenter said:

> I don't understand your comment about concatenating unicode strings.
> What's wrong with wcscat?

Nothing if you don't mind buffer-overflow errors :-)

However, one thing is wide character support and another thing is Unicode.

A wide character (as opposed to a multi-byte character sequence) is a
fixed-length representation of characters from a large character set. The
type of wide character string supported by wcscat is null(0)-terminated.
This is *not* a Unicode string, although it may be a representation of a
sequence of characters from the Unicode character set. It's not a Unicode
string because:

-- certain codes are not valid Unicode characters. There are quite a number
of these, actually, scattered around the Unicode code space. These are not
unassigned characters (there are lots of those); rather, they are codes
which *cannot* appear in any Unicode string.

-- the full Unicode code-space requires a little more than 20 bits to
represent. A given implementation might only use 16 bits, in which case it
could only represent a subset of Unicode strings. Or it might use 24 or
even 32 bits, in which case the vast majority of codes would be invalid
(the largest possible Unicode character is, I think, 0x10FFFD, but I'm
talking off the top of my head here).

-- Unicode strings are *not* null-terminated.

Unicode defines four types of normalisation -- you can find the details in
the Unicode book (or on the internet). Normalisation is important (and
normative). For example, there are at least two ways of representing the
glyph ñ: first, there is a code for it; second, you could represent it as
an "n" followed by a combining tilde. (That's not the same as the ~
character, which is not a combining character.) Glyphs are composed from a
base character (normally, but see below) and any number of combining
characters. Some of the combining characters are independent of each other.
It's going to be hard to illustrate this with ANSI mail, but I'll try to
give you the idea: consider the cedilla in ç (the little tail thing) and
the accent in á. Both the cedilla and the tilde have Unicode
representations as combining characters. Now suppose there were a language
in which a vowel could appear with an optional accent above and an optional
cedilla below. (For example, Gwich'in, which is spoken in northern Canada.)
I could code this character in three ways: [á, cedilla] [a, ´, cedilla] [a,
cedilla, ´]. However, they would appear identically in any output. (That's
a rule.) Normalisation provides a specific canonical representation.

Actually, it's a bit more complex because there are four normalisation
forms, so there are actually four canonical representations, although in
the example above only two actually apply. To compare strings, you have to
normalise both strings according to one of the normalisation forms, and
then compare the normalised values. A Unicode implementation must be able
to deal with this, but there are obvious reasons why I would not want that
to happen for arbitrary octet-sequences (and not just because of the
overhead).

Now, the comment about concatenation not being closed. What I said was that
normalised Unicode strings are not closed under concatenation. Here's one
example.

It is not illegal to start a Unicode string with a combining character. (I
think it should be but I'm not the Unicode camelxxxxcommittee.) Such a
string is to be interpreted as though the combining character had a (normal
variable-width breaking) space as its base character. So a string might be
[´, b, a, r] which is normalised (in all normalisation forms). (The
interpretation rule does not imply a normalisation.) Another string might
be [f, o, o]. That is also normalised in all normalisation forms. The
sequence [f, o, o, ´, b, a, r] is not normalised in all normalisation
forms. So the concatenation might not produce a normalised string; hence,
my claim that normalised Unicode strings are not closed under
concatenation. (One might also ask whether the concatenation ought to be
considered as foo´bar or foóbar, but that's another issue.)

There is quite a bit more to this stuff, like the bizarre rules on Jamo
characters and the issues around certain scripts in which combining
characters effectively invert character order but not always. And then
there are the two ways of doing bidirectionality (old and new). I could go
on, but that would be getting into my Unicode rant.

I don't actually object to Unicode. I think it is probably a step forward.
It would be a good thing if every living language could be dealt with by
computer programs, and I applaud the initiatives of the ISO and the
Unicoders, even if I think they produced a bit of a monster.

But what becomes crystal clear when you look at Unicode is that a Unicode
string is *not* just a variable-length array. So Unicode support by all
means -- but not as a replacement for the octet sequences.

Rici

We have the chance to lift millions out of poverty. Only one thing is
missing -- you.

Please join the Oxfam trade campaign at http://www.maketradefair.com

Oxfam works with others to find lasting solutions to poverty and suffering.

Oxfam GB is a member of Oxfam International, a company limited by guarantee and registered in England No. 612172.
Registered office: 274 Banbury Road, Oxford OX2 7DZ.
Registered charity No. 202918.

Visit the web site at http://www.oxfam.org.uk

Prev by Date: Re: closure upvalues
Next by Date: Re: Unicode
Previous by thread: Re:
Next by thread: Q. toLua and vararg functions
Index(es):
- Date
- Thread