[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Features you would like to see
- From: David Given <dg@...>
- Date: Mon, 20 Aug 2007 01:11:27 +0100
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Duck wrote:
[...]
> At the monent, the ANSI-based IO lib uses FILE* objects for files, and
> uses fseek() to navigate around in them. But fseek() takes a 'long'
> offset, not an 'off_t' one, so unless your longs are (or are forced to)
> 64 bits throughout, file sizes are limited to 31 bits.
ANSI C has fsetpos/fgetpos, which use opaque fpos_t objects to store
positions, but unfortunately they're not very useful as you can't turn an
fpos_t into a integer or back again. And I don't know whether fseeko/ftello
are in the new ANSI standard.
[...]
> Go west from, say, California for a few thousand kilometres. See how
> your 8-bit chars burst at the seams as you travel :-)
Well, I'm about 75% of the way through writing a Unicode-aware word processor,
from scratch, in almost pure Lua, using standard Lua strings, and they're
doing just fine, thanks very much!
The thing is, switching to 16-bit or 32-bit characters doesn't help. Really.
It looks like it does, but it doesn't. The problem is that in *any*
Unicode-based encoding, you can't necessarily express a character as a single
value. Even the concept 'character' is extraordinarily fuzzy and ill-defined.
A concrete example: surrogates. Surrogates allow UTF-16 to represent a Unicode
code point that won't fit in a 16-bit word. The code point in question is
represented as two UTF-16 values. If your application's not aware that these
two values represent the *same* code point, all kinds of nasty things can
happen, including mangling your data (if you split the string between those
two values, you'll end up with two invalid UTF-16 strings and will have lost
the character in question).
Even UCS-4 doesn't help --- combining characters suffer from much the same
problem.
Half the Windows and Java programs out there break in this way, simply because
they use UTF-16 throughout, which makes people believe this sort of thing
isn't a problem.
If you actually want to handle Unicode *properly*, you need to be prepared to
handle variable-length characters in your strings... which means you might as
well do everything in UTF-8, which is nicely compatible to ASCII and compact
to boot. (Lua is perfectly happy with UTF-8.)
(However, it *would* be nice to have a drop-in Lua regex replacement that
understood UTF-8; but I appreciate that it's a huge step up in complexity.)
[...]
> So just as Lua proudly provides integers which exceed 32 bits of
> precision on all sorts of platform (by standardising on 'double' as a
> number type), it ought these days to have a standardised alternative
> which can exceed 64 bits of precision.
This is easy: simply redefine lua_Number to be whatever type you like. It's
only double by default. Redefining it as long double should get you your 64
bits of precision.
- --
┌── dg@cowlark.com ─── http://www.cowlark.com ───────────────────
│
│ "There does not now, nor will there ever, exist a programming language in
│ which it is the least bit hard to write bad programs." --- Flon's Axiom
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGyNwvf9E0noFvlzgRAhddAJ4m1GeVCdEGZfO3DnoV6dL9XblmpgCgjhW7
zhXLLCBVKUbl2r7KdYzOqOU=
=jerA
-----END PGP SIGNATURE-----