Speeding Up Strings

lua-users home
wiki

Summary

There have been repeated discussions on lua-l (for example [1]) about the cost of passing strings from C into Lua, with the implication that it is a major overhead. This page seeks to document an experiment in modifying Lua to reduce string creation overhead by implementing "prestrings" -- lazily interned strings. A sample API is outlined, and the results of a benchmark test of a proof-of-concept implementation of the API are presented.

The proposed API cannot be used safely in a threaded environment without severe restrictions which might be difficult to guarantee. In addition, the benchmark results do not indicate the potential for significant speed-up in common use cases.

In short, this is more of a failed experhiment than a suggestion, but it's useful to report failures as well in order to avoid reinventing flat wheels.

Stringing along

Basically, the process the Lua API follows when importing a string into Lua, either with lua_pushlstring() or as the result of a primitive which creates a new string, is:

1) A hash is computed out of a sample of no more than 32 characters of the string.

2) Using the hash, the string is looked up in the global interned-string hash table.

3) If an existing string is found, then it is used.

4) Otherwise, a new TString object is allocated and the string is copied into it. The new object is inserted into the intern table using the hash value computed at step 1.

This works particularly well for shortish strings which are likely to already exist in the intern table, such as global variable names and other table keys, since it avoids unnecessarily allocating or copying the provided string, although it will incur the cost of a full comparison of the new string with one existing string.

It also works reasonably well for strings which are unlikely to already exist in the intern table since the hash lookup will probably never compare more than a few characters of any existing string on the chain, although it will incur the cost of a memory allocation and a copy.

Nonetheless, there remain some use cases where these overheads seem like they might be excessive. Some people worry that the cost of interning is too high for a string which will never be used as a table key, while others focus on the copy in step 4, above, which may be unnecessary if the string is going to be constructed piecemeal or created by some other library which expects a fixed-length buffer (for example, fread.)

The evolution of an API

Since I belong to the second group, I thought it might be useful to find a way to allow the creation of a new string without copying. On the surface, this looks easy: you simply create something which looks like a string (or at least is the same size as a string would be) and provide the starting address; the API consumer can then fill the contents in as desired and then convert the object into a real string, by interning it.

That suggested a simple API:

   lua_newprestring(L, size);
   lua_tostring(L, index);
   /* Note that lua_tostring is a macro which expands
    * to lua_tolstring(L, index, NULL);
    */
The first call creates a "prestring" (i.e. something which can be converted to a string) of a given size, and the second one converts the prestring to a string. Using lua_tolstring for this purpose seemed clear and obvious.

Now, the prestring is a real Lua object, so it could theoretically be returned from a C function as an uninterned string. And since it is turned into a real string by calling lua_tolstring on it, and since lua_tostring is the API for extracting a string's address and length, no modification of standard library functions is necessary in order to use prestrings, although it turns out to be useful to have an API call which allows the use of the prestring without interning it:

   lua_rawtolstring(L, index, &len);

I could have hijacked lua_topointer, but lua_topointer doesn't return the string length; furthermore, it's good to know that the object at a stack index is actually a string, and not just anything which might have a pointer.

My first attempt to use the API was to write a file method :buffers, which is a fixed-length analog to :lines:

   for b in f:buffers(2000) do
     -- #b is 2000 unless there are fewer than
     -- 2000 characters to read
   end
:buffers could just return a string, but it seemed more interesting to return a prestring, in case the use of the return value didn't require interning: say, if it were just going to be sent to some output. In that case, :buffer could even reuse the same prestring on each iteration, provided that it hadn't been stringified. (This is not entirely safe since the client program could keep a reference to the buffer, even though it hasn't been interned, and might get surprised that the value changes.)

That meant that the implementation of :buffer needs to be able to know whether or not the prestring has been stringified, and that required another API:

   lua_rawtoprestring(L, index, &len);
which returns the address of the contents of the prestring at index, if it is a prestring, and otherwise NULL.

Using lua_rawtoprestring and lua_newprestring, :buffer can decide to either recycle a prestring or create a new one, but for reasons which now seem hard to justify, I added yet another API which combines the two actions; the most semantically complicated API call in the implementation:

   lua_toprestring(L, index, &len);
which examines the object at stack slot index and:

1) if it is a prestring, does not modify it;

2) if it is a string, creates a new prestring with the same length (without copying contents) and replaces stack slot index with the new prestring;

3) if it is anything else, creates a new prestring whose length is the value currently pointed to by the &len argument, and replaces stack slot index with the new prestring;

4) finally, regardless of which of the above was performed, returns the address and length of the prestring at stack slot index

Although that's hard to explain, and maybe hard to understand, it's quite easy to use. For example, the implementation of :buffers is as follows:

static int io_readbuf (lua_State *L) {
  FILE *f = *(FILE **)lua_touserdata(L, lua_upvalueindex(1));
  if (f == NULL)
    luaL_error(L, "file is already closed");
  else {
    size_t nread;
    size_t toread = lua_tointeger(L, lua_upvalueindex(2));
    char *buf = lua_toprestring(L, lua_upvalueindex(3), &toread);
    nread = fread(buf, 1, toread, f);
    if (nread == toread)
      lua_pushvalue(L, lua_upvalueindex(3));
    else if (nread > 0)
      lua_pushlstring(L, buf, nread);
    else if (ferror(f))
      luaL_error(L, "%s", strerror(errno));
    else
      lua_pushnil(L);
  }
  return 1;
}

static int f_buffers (lua_State *L) {
  size_t bufsize;
  tofile(L);
  bufsize = luaL_optinteger(L, 2, LUAL_BUFFERSIZE);
  lua_settop(L, 1);
  lua_pushinteger(L, bufsize);
  lua_pushnil(L);
  lua_pushcclosure(L, io_readbuf, 3);
  return 1;
}

A prestring is just a nascent string

A key part of this entire process is that the prestring is converted in place into a string either explicitly or when needed. Prestrings don't have a separate type; they look at all times like ordinary strings, unless one of the above prestring-aware API calls is used. Existing C modules and Lua programs do not need to be modified in any way to take into account the existence of prestrings, and if you want a C module to avoid autointerning a prestring, one only needs to replace lua_tolstring with lua_rawtolstring.

Since a prestring is really a lazily-interned string, it has to act as though it were a string the Lua environment. That means that it has to be interned in order to be used as a table key. (The proof-of-concept implementation also auto-interns prestrings which are arguments to ==.)

Once a prestring is interned into a string, it must not be changed. So it's not really safe to modify a prestring which has been released into the Lua environment. You must finish creating the contents of the prestring before letting any Lua or CFunction see the prestring.

That's fine in a non-threaded environment, and the above implementation of :buffers will work; it is careful not to modify the prestring without checking to make sure that it still is a prestring. But it cannot be made to work safely in a threaded environment; the :buffers iteration function does not run inside of the lua_lock and another process which had a reference to the buffer could stringify it while the call to fread were in progress.

That's unfortunate because the benchmarks I ran against the proof-of-concept implementation show that the only significant speed-up available is through recycling prestrings in a hard loop. The overhead of interning and even copying strings is just not enough to justify the complexity of the implementation (in my opinion).

Trying to benchmark

The first benchmark I tried was reading and writing file buffers, using a function rather similar to :buffers as shown above. The iolib was modified by adding the :buffers iterator and replacing a few calls to lua_tolstring with lua_rawtostring. However, all of the timings were dominated by the I/O system calls, so the results were not conclusive, except in the sense that they demonstrate that in a normal usage with file I/O, the savings are not even going to be noticeable.

In order to try to measure more exactly what the benefits might be, I created a sort of pseudo-file, which consists of just under 16K random printable characters in a ringbuffer. The ring buffer is "read" by copying the number of requested bytes and advancing the internal "file" position, always modulo the ringbuffer size (which is a prime). Periodically, the ring buffer is rerandomized; the end result is that it is extremely unlikely that a given "read" will result in an existing string.

Using the above API, there are at least four possible ways of returning a buffer of characters:

1) "recycle": Always return the same prestring, with new contents

2) "prestring": Return a new prestring each time, but don't intern it

3) "inplace": Create the new string by first creating a prestring of the correct size, then reading data into it, and finally turning the prestring into a string before returning it.

4) "string": The current mechanism: the file is read into a prestring (although it could just as well have been a preallocated buffer), and then lua_pushlstring is called to create a string in the normal way.

Comparing the times for these four strategies reveals the cost of

1) allocating memory (the difference between "prestring" and "recycle")

2) interning (the difference between "inplace" and "prestring"); and

3) copying (the difference between "string" and "inplace").

The benchmark was done rather unscientifically on a laptop running Xubuntu (I tried to avoid using the machine while the benchmark was running, but there were still lots of processes active). Lua was compiled with -O2 -DLUA_USE_APICHECK (it should have been compiled without LUA_USE_APICHECK although I don't think that affects the relative timings; obviously, the execution times reported are slower than they should be in a production build. The machine is an AMD Turion running in a 64-bit environment. For each buffer size, the simulated file size was set to 1,000,000 buffers and execution time was collected with os.clock, so the results can be read as µs per buffer. Each test was run ten times and the fastest time was selected for the summary; the results are summarized in Files:wiki_insecure/users/rici/buftest.pdf. As can be seen, the interning time is on the order of 200-250 nanoseconds per string, which is only noticeable for very short strings; the copying time is at worst 13% of the total, with the largest buffers tested.

Proof of Concept

The implementation is not particularly polished, and since the results were not spectacular, I'm unlikely to continue develop it. In case anyone is interested, I've put a patch at Files:wiki_insecure/users/rici/prestring.diff

Some notes on the implementation:

Garbage Collection

Every garbage collectable object has a next member which Lua uses to perform the sweep phase of the garbage collection. Objects other than strings are kept in a singly-linked list whose head is in the global state. Strings, however, are kept in the string intern table, and the next member is used to link together hash chains. (In effect, the head of each hash chain is part of the garbage collection root.)

Once a prestring has been interned, it must be removed from the allocated object linked list and linked into the appropriate hash chain in the intern table. The only way I could think of doing that was to doubly-link the list of allocated prestrings. The back pointer is kept in a union with the hash member, which is not needed by prestrings. On 32-bit architectures, this has no cost, but on 64-bit architectures it increases the string overhead from 24 bytes to 32 bytes, since hash is an int. Possibly it would have been better to just assume that there will never be a lot of prestrings, and to keep them in a singly-linked list which is just searched linearly when a prestring is interned.

Other than that, I don't think there are any GC issues. Like strings, prestrings are always created on the stack and can be created white since the stack is marked atomically.

Locking

The lock mechanism offered by the lua_lock and lua_unlock macros is designed to avoid garbage collection or table resizing from interfering with API calls or VM execution. Lua does not attempt, even in theory, to prevent race conditions from other object mutations.

Prestrings introduce a new form of mutation: changing a mutable object (a prestring) into an immutable object (a string). As noted above, the lua_lock mechanism cannot prevent a race between changing the contents of a prestring and changing a prestring to a string.

Other notes

Prestrings have the same type as strings (LUA_TSTRING) and have the same memory layout other than the hash member. One flag was added to the object header to distinguish prestrings from strings.

When a prestring is interned, it is possible that the result will be an existing string. In this case, we'll end up with two objects: the prestring and the old string. This could result in the same prestring being interned multiple times. To reduce the possibility of this happening, the autointerning done by key lookup and the equality operator both reset the string pointer to the old string; implementing that required changing a number of internal prototypes from const to non-const, and without doing much to solve the problem.

I think it would probably better to stash a forwarding pointer in the prestring (using the overloaded back pointer/hash) and rewriting the string pointer during garbage collection. (The garbage collector could not rewrite pointers in the stack frame of a C call but ought to be able to rewrite them elsewhere. Rewriting a pointer in a C call stack frame could result in the prestring being freed by the garbage collector while the C function was using a pointer to the prestring.)

--RiciLake


RecentChanges · preferences
edit · history
Last edited December 15, 2007 3:08 am GMT (diff)