lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David Jones wrote:
[...]
> When I implemented Lua in Java, strings were implemented using  
> java.lang.String (so using Java's 16-bit unsigned char type).  I took  
> a similar position, string.byte returned an integer between 0 and 65535.

The problem is that Unicode is *hard*, and Unicode strings are a fundamentally
different kind of thing to a byte string. You can consider a byte string to be
an array of bytes. You can't consider a Unicode string to be an array of
anything sensible.

The biggest issue is that there's nothing you can really point at in a Unicode
string and say, that is a character. You may be talking about a single code
point, or a single code point expressed as two surrogates, or a grapheme
cluster expressed as a group of code points; and it's possible that the
character can be validly represented by several combinations of code points.
So I think that just saying I-want-to-be-able-to-access-the-fifth-character is
a fairly doomed thing to want to do.

I suspect that the Natural(TM) way to deal with Unicode strings is to ditch
the concept of character offsets completely, and deal instead with high-level
concepts. So instead of using string slicing to parse a string, you use
regular expressions and captures.

Of course, I don't know whether this actually works in real life --- in most
of the work I do with strings, I'm only interested in the ASCII subset...

- --
┌── dg@cowlark.com ─── http://www.cowlark.com ───────────────────
│ "Feminism encourages women to leave their husbands, kill their children,
│ practice withcraft, destroy capitalism and become lesbians." --- Rev. Pat
│ Robertson
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGMK5qf9E0noFvlzgRAhq7AKCcnfgJknKffw/BsSRKeUMA6sMzEQCggijI
eLGmudN2bAsCuwZ2a9fa38k=
=m2l3
-----END PGP SIGNATURE-----