lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Sun, Feb 12, 2012 at 08:51, Jay Carlson <nop@nop.com> wrote:
> (note: I'm not recommending this as a processing strategy, but
> exploring what can be done with a mechanism of memoized type
> assertions on strings. this is lua, we give you the rope, you go do
> your own thing.)
>
> On Feb 10, 2012 8:00 PM, "William Ahern" <william@25thandclement.com> wrote:
>>
>> On Sat, Feb 11, 2012 at 12:20:42AM +0000, David Given wrote:
>> > On 10/02/12 19:25, William Ahern wrote:
>> > [...]
>> > > Not unlike the way strings are internalized in Lua, each unique grapheme
>> > > cluster is dynamically assigned a codepoint at runtime, so that clusters can
>> > > be easily compared.
>> >
>> > Now, that's a disturbingly cunning idea. Easy, too. And there's plenty
>> > of space at the top of Unicode code point range for these synthetic code
>> > points. I might have to steal that for a non-Lua project; thanks.
>
> I like that. Everything beyond plane 16 isn't Unicode, but the things
> the UTF-8 convention can express beyond U+10FFFF (call it DUTF-8 for
> denormalized maybe) are the ultimate Private Use Area. Just make sure
> you never emit them.
>
> In UTF-8 space, about half of the otherwise legal four-byte sequences
> are unused. We get "U"+11000 to "U"+1FFFFF handling for free if we'd
> like. If we supported five-byte sequences, the additional five bits
> Should be Enough for Anyone. How many grapheme clusters are you going
> to use, anyway?
>
>> > Of course, in Lua, since strings are all internalised anyway, you might
>> > as well use string pointers instead of synthetic codepoints; that way
>> > you let the string internaliser do the heavy lifting and avoid needing a
>> > specialised API, and you still get to do comparisons by comparing
>> > pointers...
>> >
>>
>> Yeah. I didn't connect the dots on the string interning till after I sent
>> the message.
>
> Switching to strings as Lua tables (with either bytestring or numeric
> content) makes UTF-32 look lightweight.
>
> Still, a global table of {DUTF-8 \u11F8ED -> UTF-8 composed grapheme
> cluster} would work nicely.
>
> Note that the presence of of beyond-UTF-8 content is still a feature
> of the bytestring value, not any declaration or box. So since I'm
> using up bits in the reserved byte in strings, we can have a two-bit
> field:
>
> 00: Unknown contents of string; nobody cared to run assert_utf8() or
> assert_dutf8() on it.
> 01: Not in UTF-8, and not in DUTF-8 either. Will break {f: domain
> DUTF-8 -> range DUTF-8} and hence also UTF-8 guarantees.
> 10: In DUTF-8 but not UTF-8. DUTF-8 is closed under a (slight)
> superset of the same operations as UTF-8.
> 11: In DUTF-8 and UTF-8 (including blocking the surrogate area). In
> conformance with http://tools.ietf.org/html/rfc3629 , legal for
> interchange.
>
> For the most part if a string function f has DUTF-8 string parameters
> as its domain and DUTF-8 as its range, for strictly UTF-8 parameters,
> f's range will be UTF-8.
>
> If you're checking whether something is in DUTF-8 or UTF-8 it costs
> almost nothing more to check the other since you'll be walking the
> whole string regardless. Yes, you can bomb out early on some UTF-8 but
> I bet it doesn't happen much aside from CESU-8 bogons.
>
> What am I missing? Aside from any hope for space efficiency for
> non-ASCII alphabetic scripts of course.
>
> Jay
>

Sounds like a nice idea, but I can't help but be reminded of old
programs that used ASCII 128-255 for special purposes...

-- 
Sent from my toaster.