lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Wed, Feb 8, 2012 at 12:17 AM, HyperHacker <hyperhacker@gmail.com> wrote:
>> Obviously we still need to keep around bytewise operations on some
>> stringlike thing. (Wait, obviously? If you're not in a single-byte
>> locale the candidates for bytewise ops look like things you don't
>> necessarily want to intern at first sight. The counterexample is
>> something like binary SHA-1 for table lookup.)
>>
>
> Mind, Lua strings are often not strings of text, but strings of binary data...

I think I was jumping ahead of myself there. Let me explain the whole
mess for people not following the situation.

The nice no-nonsense value-semantics bags of arbitrary bytes are
missing from so many other languages. It would be a shame to give that
up.[1] In the discussion of the hash collision problem, an
implementation strategy of simply not hashing long strings until
needed was discussed. Your code wouldn't notice, except that the
performance characteristics would change. This is similar to the array
part of tables. You can use any key you want in a table, but if you
use ascending integers, performance will be better.

One use of strings is to slurp whole files in and hand them to
external code, never to be seen again. Another is to serve as the
stereotypical 4kbyte input buffer. In neither case is the identity of
the string important. You never say "is this 4k I read from the
network the same as another transient buffer?" Nor do you use them as
keys in tables. If the only operations on long strings are to search
and extract substrings, then adding them to the global string table
has little benefit--the only use is if their lifetime overlaps with a
completely identical string, in which case they can share storage.

Plus implementation simplicity. Behind the scenes there would have to
be two different kinds of strings, but they'd have to work
identically.

What I was thinking is that the bag-of-bytes usage often matches up
with the uninterned string usage pattern and wondering how much of the
current string.* is really useful in both cases. In particular, I was
wondering out loud if there was reason to have two distinct
user-visible types to reduce the complexity of making the two work
identically, or if they should be one type with two distinct
implementations.

I'm thinking out loud in a lot of this, and if seems a little jumpy,
it's because my thoughts are not settled.

How often do you use the blobs as table keys or compare them for equality?

Jay
[1]: Well, give that up again perhaps. Userdata used to have value
semantics. People didn't like it. But interned bags of bytes look a
lot like that, just with accessors.