[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Mutable strings (was: VM object types) (was: various other posts)
- From: Rici Lake <lua@...>
- Date: Fri, 13 Jul 2007 13:42:42 -0500
On 13-Jul-07, at 12:47 PM, Jerome Vuarand wrote:
It could be possible to override lua_istring, lua_tostring and
lua_pushstring to accept/generate such string userdata (which would be
identified with a __isstring boolean metamember). That way many binary
modules designed for non-mutable strings could work transparently with
mutable ones.
That seems a bit complicated to me, and also fraught with problems.
It's fine for a function which does not retain the string, but
what of functions like string.gmatch which do keep a reference?
Suppose that the mutable string were an input buffer, which is
a likely scenario. Or suppose that the function might call a
Lua callback (string.gsub, for example).
I'd suggest that lua_tostring should coerce a mutable string
into a regular string, and that a new interface be defined,
something like:
const char *lua_transientpointer(lua_State *, int, size_t *)
Functions which know absolutely that their use of the string
will be short-term could use that interface to get a pointer,
with the understanding that it could be invalidated by any
Lua API call. That would certainly cover the cases where the
function was immediately sending the string to a file or socket,
or copying it into a std::string, etc.
That API is typed as returning const char * so that it can work
with both mutable and immutable strings. In the inverse case,
where the function really needs the string to be mutable, one
would want:
char *lua_tobuffer(lua_State *, int, size_t *)
which would coerce immutable strings into mutable ones (that is,
by copying.)
For the lua_tostring it's straightforward, for lua_pushstring it may be
desirable to activate it on a per function basis (which could have a
__pushmutablestrings boolean member in the function metatable) by the
module user.
There is only one metatable for all functions, but in any event,
I think the interface I sketched out above would be sufficient, plus
of course:
char *lua_newmutablestring(lua_State *, size_t)
Personally, I would be quite content with fixed length mutable
strings, although I know some people find that odd. My reasoning
is as follows:
1) A fixed-length mutable string can be stored in the same
representation as a normal Lua string, so it can be modified
in place into a Lua string without copying.
2) There are a number of cases where the length of the string
is known in advance (or can be guessed with a reasonably probability
of reliability). In particular, that covers the cases of input
and output buffers, and those implementations of deserialization
("unpickling") where the string length precedes the string in the
stream.
3) Where the string's length cannot be predicted accurately in
advance, it is likely that the application will use some sort
of exponential realloc. In such cases, the additional overhead
of copying the string header is minimal. In addition, if the
string is later to be interned (and thus made immutable), a
final realloc (i.e. copy) will be done to free the overhead of
the last realloc, so an additional copy will be necessary anyway.
All of this is based on my theory that the cost of interning
long strings is not the interning per se, since the Lua hashing
function only examines a maximum of 32 characters. In the case
of input buffers, which are typically fairly long, the interning
usually does not involve a full comparison of the string with
any existing string, because no identical string exists, and
the memcmp() terminates quite early.
Rather, the predominant cost of interning long strings is
the cost of copying the string into a newly-allocated interned
string structure. Consequently, I favour solutions which avoid
the copy, and am not too concerned about the occasional
"unnecessary" hash. I'm particularly interested in the use case
of passing input buffers to output sockets without incurring
copy overhead, which in my measurements is significant.
I've posted some benchmarking exercises in previous posts, and
I'd be interested in any real results which show that I'm
incorrect in the above assumption.