It's based on an idea by Rici Lake (briefly mentioned in the 'Lua strings as "atoms"' mailing list thread in May 2005). It was further discussed and refined on irc #lua. I decided to do a prototype implementation to test the validity of the concept.
This patch is intended to gain feedback on the usefulness of fast strings for Lua applications. Any and all comments are welcome. Especially needed are benchmark comparisons for real applications that use many (different) small strings. Thank you!
To avoid any misunderstanding: the patch is *NOT* intended to be merged into the Lua 5.1 standard distribution nor would I recommend to do so (at this time). I leave it entirely up to the judgement of the Lua authors whether anything from this patch is worthwhile to be included into any future Lua version.
Please send your comments to the Lua mailing list, since the Wiki is not well suited for discussions.
License notice: I hereby declare that all work I contributed to the Lua core is to be governed under the same license as the Lua core itself.
Click to download the [patch itself] or a [patched-up Lua 5.1-work6 distribution].
A number of fixes for the baseline have been included in the patch: MSVC number2int fix, *s performance improvement, remove undefined lua_checkpc assertion.
Lua stores all objects in tagged value slots on the Lua stack or in Lua tables. These slots are usually 12 or 16 bytes long (depending on the architecture and the type of lua_Number). The size is dominated by the requirement to store properly aligned pointers and lua_Number's. The tag itself is a small integer and easily fits into a single byte. That leaves us either 11 or 15 bytes of free space that can be put to good use.
The basic idea is to store fast strings (i.e. short strings of up to 11 or 15 bytes) directly into tagged value slots instead of allocating them on the heap.
This means that fast strings behave more like numbers in that they do not need to be allocated, garbage collected or freed. This significantly reduces VM overhead.
Neither the Lua API not the Lua/C API have been changed. There is no need to change anything in your Lua or C modules (but see below under "Caveats" for some notes about Lua/C API conformance).
Fast strings behave and look exactly like regular strings outside the Lua core. You do not need any special provisions for using them and there is no externally visible difference.
Yes, this means type() returns "string" and lua_type() returns LUA_TSTRING, no matter how the string is stored internally. You can use all Lua functions and all Lua/C API functions for strings without restrictions.
Of course your application may be faster and it may use less memory. So you may notice a difference after all. But that's kind of the point of this patch ...
Many applications use short strings extensively. It's well known that there is a significant skew towards short strings in the distribution of string sizes used by typical applications (e.g. names, tags, identifiers, natural language words, ...). And many of these short strings have only a very limited lifetime (e.g. temporaries, used only in local scopes).
As a consequence, short strings usually incur a high overhead for memory management (allocate and free), for garbage collection (mark and sweep) and for managing a shared string hash table. There is additional overhead due to contention with other managed objects:
Fast strings avoid most of this overhead:
Two key advantages of a shared string table are that string equality and table lookup are cheap. The former needs a type plus pointer comparison, the latter can reuse the precomputed hash value (but it needs to be stored with the allocated string).
Equality checks on fast strings are equally cheap when done right: bulk comparison of a tagged value slot is fast and tag number comparison can be optimized, too (e.g. by holding the fast string length). Fast string hashing can use a fast bulk hash of the whole slot. Any remaining hash overhead can be reduced by avoiding the hash generation for lookups in small tables (a linear bulk comparison is faster).
Obviously one can prove almost anything when selecting the proper
benchmark. So I left out some targetted benchmarks like:
lua -e 'for i in io.lines() do end' </usr/share/dict/words
or excessive use of
These show tremendous speedups (anywhere from +60% to +200% depending on the architecture), but probably do not exercise typical application behaviour.
On the converse you won't see any difference on standard benchmarks that do not use short strings extensively (e.g. fibionacci numbers or matrix multiplication).
So I've just taken the obvious candidates from the "great computer language shootout" benchmarks and compared them on different architectures:
Benchmark | x86 x64 PPC32 PPC64 ------------+---------------------------------- hash | +38.2% +55.8% +40.9% +38.7% spellcheck | +23.6% +45.8% +17.7% +25.2% reversefile | +6.2% +22.7% +10.2% +13.6% wordfreq | +2.2% +8.3% +10.6% +8.6%Well ... not bad, huh?
But of course it would be best to see some benchmark results from real applications. Please share any insights you may have with the mailing list. Thank you!
16 bit systems are not supported right now and you'll get a compiler error if you try. This is more a matter of testing and supporting 16 bit size_t's (note that some 16 bit systems use 32 bit size_t's, but the current implementation is not tuned for this and refuses to compile on any 16 bit system).
One important requirement for the implementation of fast strings was to not change the Lua/C API. This avoids the necessity to rewrite any C modules and helps with getting quick feedback for the prototype implementation.
However fast strings and regular strings need to coexist inside the Lua core and must be treated differently almost everywhere. But there should not be any visible difference between them from the outside and they should be treated the same. This necessitated a split between internal and external tag numbers.
Internal tag numbers are only visible inside the Lua core and are stored in all tagged value slots (stack and table slots) and GC objects. External tag numbers are only visible outside of the Lua/C API and are used by all C modules.
The internal tag numbers are defined in lobject.h. All of their names start with 'TT' (e.g. TTNIL, TTBOOLEAN, ...). They are stored in a single unsigned byte (tt in struct TValue). The tag number is split into a 4 bit major number and a 4 bit minor number. The major number holds the basic object type, the minor number holds additional details. The minor number is currently only used for booleans (to hold false (0) or true (1)) and fast strings (to hold the string length).
At the Lua/C API boundary all internal tag numbers are mapped to the external tag numbers defined in lua.h. Their names start with 'LUA_T' (E.g. LUA_TNIL, LUA_TBOOLEAN, ...) and keep the same meaning as in a standard Lua core. Both fast strings and regular strings map to the same external type (LUA_TSTRING). The same API calls work with both kinds of strings and well written C modules will never notice any difference (but see below under "Caveats").
Clever allocation of internal tag numbers helps to speed up many common checks. E.g. ttixstring() checks for both kinds of strings in one operation and with only one conditional. Dito for the performance sensitive l_isfalse() macro. The tag number for a fast string of maximum length is always zero (0x00) and serves as a terminator for C strings, too. This means that lua_tolstring() can simply return a pointer to the stack slot for fast strings.
Due to the string pointer stability guarantees given by the existing Lua/C API (see below) it's required to avoid stack reallocation. For the prototype implementation I decided to use a fixed stack size (see LUAI_FIXEDSTACK in luaconf.h). This is of course not desirable in the long term and several better solutions exist: stack chunking, lazy stack copying/freeing or changing/augmenting the the Lua/C API guarantees. However this was just not within the scope of this prototype implementation (sorry).
Apart from the changes needed for the internal type system, all places where strings are created or accessed needed to be changed to handle both variants. This is what the ttixstring(), xsvalue() and xslen() macros and the luaS_set* functions and macros are for. Object copying has been changed to copy the whole tagged value slot. Some changes for object comparisons and hash tables were needed, too. Dumping/undumping a function always stores the external type value in pre-compiled chunks.
The Lua compiler (llex.c, lparser.c, lcode.c) only deals with strings in the form of TString pointers. To avoid extensive modifications without any perceived benefit, this has not been changed. This means the compiler always uses regular strings internally (even for short strings). They are only converted to tagged values (regular or fast strings) when they are used as constants in function prototoypes. All strings used for debug information (like variable names) are still regular strings. This turns out not to be a problem because the debug API only returns string pointers, anyway.
The patch looks lengthy because of the many tiny changes. But it has little impact on the resulting code size: it adds only ~1.8K to the Lua core (with default settings on x86 + gcc 3.3).
There is one tunable (LUAI_FSTRINGWORDS in luaconf.h) that allows you to change the default maximum size of fast strings from 11 to 15 characters on 32 bit systems. Note that 64 bit systems always use a limit of 15 characters (which is the maximum, too). Please read the description in luaconf.h carefully before setting this value.
This may be helpful when a significant percentage of your strings falls into the range of 12..15 characters. But I have not found any noticeable performance gains in my tests. So I've opted for smaller stack/table slots in the default configuration (leaving LUAI_FSTRINGWORDS undefined).
Fast strings behave like regular strings when used from outside the Lua/C API boundary. But there are a few subtle differences that may affect programs that (maybe unknowingly) take advantage of a few implementation details of the standard Lua VM:
1. The specification for lua_tostring()/lua_tolstring() says: "Because Lua has garbage collection, there is no guarantee that the pointer returned by lua_tostring will be valid after the corresponding value is removed from the stack."
Previously you could sometimes get away with destroying the stack slot for regular strings and still reference the pointer (because the GC was not run immediately). However this is an API violation and may cause hard to find bugs.
Since fast strings return a pointer to the stack slot itself, non-conformant C modules *definitely* need to be fixed.
An additional restriction is that a fast string pointer may get stale if stack slots are moved (lua_insert()). However this is very rare and can be avoided easily.
2. The Lua/C API does not guarantee string _pointer_ equality for identical strings.
Some authors may have wrongly assumed that because all strings are shared, the same pointer is always returned for identical strings. This is not true in the general case. A string may have been deallocated and newly allocated if there was no reference to it anywhere in the Lua stack or in a table.
With fast strings you may get a different pointer for identical strings, depending on the referenced stack slot. And you may get the same pointer for different strings, if you happen to reuse the same stack slot.
Anyway, nowhere does the Lua/C API specification guarantee string _pointer_ equality. The only proper test for equality is to use the lua_equal() or lua_rawequal() API functions.
C modules making invalid assumptions about string pointer equality will *definitely* break with the fast string patch.
3. The Lua/C API specification makes no explicit statement about Lua stack reallocation. It is implied that string pointers retrieved from live stack slots are stable even when the Lua stack is reallocated (moved).
This is a pure necessity since almost any API call may reallocate the stack. Some C library calls (e.g. string.gsub()) may break without this guarantee.
This is the motivation for the fixed stack size workaround in the prototype implementation (see above under "Implementation Details").