lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I do agree that Lua should have an additional type for mutable strings (or more generaly vectors, not just tables: this could be done using the internal representation of tables already optimizing the dense index of integer keys, internally using 3 arrays, 1 for the dense part, 1 for the hashmap, 1 for the values mapped from the hashmap, but with a bit more deeply representation so that the dense part cannot contain any nil value).

This does not necessrily needs a new exposed type (this could be still exposed as a Lua table with an internal subtype, autodetected by key types forming a contiguous range of integers).
The values array in that case can directly contain the 21-bit codepoints if this is restricted to valid UTF-32, or could do like _javascript_ using an array of 16 bit integers, compatible with UTF-16 even if it allows unpaired surrogates, or an array of 32-bit integers, compatible with UTF-32 even if it contains out-of-range UCS-4 non-characters outside the valid 17 planes of the standard UCS.

When this is done, there's a way to create strings that would be "automutable", and a way to transform a muted string to an atomized string in the global shared table of immutable strings. As well any string could be derived: extracting substrings would just reuse the internal arrays and only the assignment in specific key positions (or other functions like insertion/replacements) would duplicate the string to make it mutable.

Anyway the problem is not only there: incremental concatenation also has a cost because there are multiple reallocations (but these reallocations are not optimal when they are immediately "atomized" to immutable strings, causing lot of work in the garbage collector: reallocation made on mutable strings would be more lazy, allocating lengths with some gaps at end (that the garbage collector could still reduce when needed).

This optimization should be generalized for tables (and all arrays), allowing them to become atomized on demand to shared immutable instances (but still without blocking them to be deatomized when motified).

The complication however is in the GC: it would have to handle mutable and immutable tables, and mutable tables would need to be collected by compressing their unused part (and possibly also recomputing its best "dense" part of integer keys. There should also exist an optimizing hint about how many "nil" keys are allowed in the dense part (of a string, this does not make sense as their character index normally forms a continuous range from 1 to #string, but we can imagine situations where #string counts only the first part of the string where all keys in 1..N are assigned a non-nil value, but still allows other indexes to be assigned in the dense part without being part of the canonical string. With this hint (e.g. allow up to 25% of nil keys in the dense part, storing other keys using the hashmap, or no more than 0% for strict dense representation; my opinion is that 50% is enough to avoid most pathetic reallocations; but we could still repack the unused part on demand, notably at end of strings/arrays, if we need to preserve memory, but the GC could do that work itself without any explicit demand by the Lua code, and it is intructed to repack strings using a smaller threshold such as 12.5% or even 0% if this only repacks the starting or ending range of integer keys because it implies to growth of the hashmap for the non-dense part).


Le ven. 7 déc. 2018 à 18:07, William Ahern <william@25thandclement.com> a écrit :
On Thu, Dec 06, 2018 at 03:22:45PM -0800, Sven Olsen wrote:
> >
> > Are you an advocate only for +=. or for similar operators for all
> > binary operators?
> >
>
> As someone who's implemented and used `..=`, my 2 cents would be that
> adding it to the language is actually unwise.  If you're going to create a
> large string through repeated concatenations, you should almost certainly
> be using some kind of pattern that leverages table.concat().  It's far more
> efficient because it avoids the allocation of intermediate strings.  `..=`
> just creates the temptation to build long strings in a way that wastes a
> lot of memory and processing time. I'm still cleaning up ancient parts of
> my own codebase where I used `..=` when I should have used table.concat().
>
> Arithmetic compound assignment ops are nice to have, but, the
> generalization to a compound assignment for all binary ops is a bad idea,
> imo.
>

FWIW, string concatenation operators in Perl, for example, are very
efficient because Perl strings are mutable, and the parser and interpreter
are very clever about optimizing string ops, including implementing CoW.
Code like

  $foo .= $a . $b . $c

is basically a memcpy from $a into $foo, $b into $foo and then $c into $foo.
Much of Perl's crazy syntax, semantics, and insanely ugly internals is a
product of the emphasis on optimizing string operations.

For a language like Lua I think a better solution to performant string
construction would be something like Java's StringBuffer. This is trivial to
do in Lua because the language emphasizes making use of the C API, and
userdata objects are first-class in terms of language treatment. The real
issue is that the Lua ecosystem isn't a batteries included kind of
environment.