[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: unicode support in lua
- From: David Given <dg@...>
- Date: Thu, 26 Apr 2007 14:51:38 +0100
-----BEGIN PGP SIGNED MESSAGE-----
David Jones wrote:
> When I implemented Lua in Java, strings were implemented using
> java.lang.String (so using Java's 16-bit unsigned char type). I took
> a similar position, string.byte returned an integer between 0 and 65535.
The problem is that Unicode is *hard*, and Unicode strings are a fundamentally
different kind of thing to a byte string. You can consider a byte string to be
an array of bytes. You can't consider a Unicode string to be an array of
The biggest issue is that there's nothing you can really point at in a Unicode
string and say, that is a character. You may be talking about a single code
point, or a single code point expressed as two surrogates, or a grapheme
cluster expressed as a group of code points; and it's possible that the
character can be validly represented by several combinations of code points.
So I think that just saying I-want-to-be-able-to-access-the-fifth-character is
a fairly doomed thing to want to do.
I suspect that the Natural(TM) way to deal with Unicode strings is to ditch
the concept of character offsets completely, and deal instead with high-level
concepts. So instead of using string slicing to parse a string, you use
regular expressions and captures.
Of course, I don't know whether this actually works in real life --- in most
of the work I do with strings, I'm only interested in the ASCII subset...
┌── ｄｇ＠ｃｏｗｌａｒｋ．ｃｏｍ ─── http://www.cowlark.com ───────────────────
│ "Feminism encourages women to leave their husbands, kill their children,
│ practice withcraft, destroy capitalism and become lesbians." --- Rev. Pat
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----