On Mon, 17 Jun 2019 at 12:57, 云风 Cloud Wu <cloudwu@gmail.com> wrote:
>
> Lua has unique string type before 5.2.1 , all the strings is interning
> for fast comparison .
>
> Lua 5.2.1 add a new internal long string type because of hash DoS
> attack , but short string is still interning. I guess the reason is
> the performance of string comparison is very important to lua , string
> interning can reduce the string comparison from O(n) to O(1).
There is also a benefit in reduced RAM usage (in some applications).
But importantly a string's hash is also used for table lookups, which
is quite a key part of Lua.
Regards,
Matthew
No pun intended? XD
The reduced RAM usage is more widespread than you might imagine. Consider Zipf's Law, which observes that the most common words in a data set are WAY more common than the least common ones. The Pareto principle lets us approximate it as "80% of the words in a piece of text come from 20% of the vocabulary." This means something as simple as splitting a string will benefit from short string interning.
The hash is remained, but can be lazy calculated like long string now.
To reduce memory usage , we can do string interning in parser stage (it’s the main source of the string object) to remove the same strings. And we can also use a cache like lua_pushstring now to avoid push the same string or combine the same string during gc .
I disagree that the parser stage is going to be the main source of string objects in general. Almost any program that has to read data from a file is going to use a lot of small strings. (It is, of course, possible to do it with a single long string and then only process it using numeric data types, but that technique only makes sense for packed binary data.)
/s/ Adam