lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


(note: I'm not recommending this as a processing strategy, but
exploring what can be done with a mechanism of memoized type
assertions on strings. this is lua, we give you the rope, you go do
your own thing.)

On Feb 10, 2012 8:00 PM, "William Ahern" <william@25thandclement.com> wrote:
>
> On Sat, Feb 11, 2012 at 12:20:42AM +0000, David Given wrote:
> > On 10/02/12 19:25, William Ahern wrote:
> > [...]
> > > Not unlike the way strings are internalized in Lua, each unique grapheme
> > > cluster is dynamically assigned a codepoint at runtime, so that clusters can
> > > be easily compared.
> >
> > Now, that's a disturbingly cunning idea. Easy, too. And there's plenty
> > of space at the top of Unicode code point range for these synthetic code
> > points. I might have to steal that for a non-Lua project; thanks.

I like that. Everything beyond plane 16 isn't Unicode, but the things
the UTF-8 convention can express beyond U+10FFFF (call it DUTF-8 for
denormalized maybe) are the ultimate Private Use Area. Just make sure
you never emit them.

In UTF-8 space, about half of the otherwise legal four-byte sequences
are unused. We get "U"+11000 to "U"+1FFFFF handling for free if we'd
like. If we supported five-byte sequences, the additional five bits
Should be Enough for Anyone. How many grapheme clusters are you going
to use, anyway?

> > Of course, in Lua, since strings are all internalised anyway, you might
> > as well use string pointers instead of synthetic codepoints; that way
> > you let the string internaliser do the heavy lifting and avoid needing a
> > specialised API, and you still get to do comparisons by comparing
> > pointers...
> >
>
> Yeah. I didn't connect the dots on the string interning till after I sent
> the message.

Switching to strings as Lua tables (with either bytestring or numeric
content) makes UTF-32 look lightweight.

Still, a global table of {DUTF-8 \u11F8ED -> UTF-8 composed grapheme
cluster} would work nicely.

Note that the presence of of beyond-UTF-8 content is still a feature
of the bytestring value, not any declaration or box. So since I'm
using up bits in the reserved byte in strings, we can have a two-bit
field:

00: Unknown contents of string; nobody cared to run assert_utf8() or
assert_dutf8() on it.
01: Not in UTF-8, and not in DUTF-8 either. Will break {f: domain
DUTF-8 -> range DUTF-8} and hence also UTF-8 guarantees.
10: In DUTF-8 but not UTF-8. DUTF-8 is closed under a (slight)
superset of the same operations as UTF-8.
11: In DUTF-8 and UTF-8 (including blocking the surrogate area). In
conformance with http://tools.ietf.org/html/rfc3629 , legal for
interchange.

For the most part if a string function f has DUTF-8 string parameters
as its domain and DUTF-8 as its range, for strictly UTF-8 parameters,
f's range will be UTF-8.

If you're checking whether something is in DUTF-8 or UTF-8 it costs
almost nothing more to check the other since you'll be walking the
whole string regardless. Yes, you can bomb out early on some UTF-8 but
I bet it doesn't happen much aside from CESU-8 bogons.

What am I missing? Aside from any hope for space efficiency for
non-ASCII alphabetic scripts of course.

Jay