lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


joining is simple to define because there's not a lot of ways to do that... Except that you may wonder what is the effect of joining two tables whose order is not defined (it is defined only on sequences), and you'll fall to the problem of how to solve conflicting keys (it you want to keep them in the joined table) or how to compute new keys in the joined table (even when joining sequences, at least one of them must be renumbered if you want to keep all values but at a different set of keys, this implies an arithmetic transform at least).
For splitting, the problem is: what is the splitting criteria: a criteria on keys, or a criteria on values ? And there are also endless variations of both criterias.
Howevever a common splitting pattern is the "select" operation that would split a table in two parts: one for all pairs that match a condition, and another for all pairs that don't match it. 

And even in this case, you may be interested to keep the original keys, or discards keys and generate new sequences, so you also need a mapping from the old keys to the new ones.
In a string, the problem is equivalent (just think about how items in a string should be indexed/counted: do you mean splitting by byte offset, or by substrings that match a criteria, such as an UTF-8 sequence?

There's no general solution to math all problems. The existing implementation of split() operations on strings are based on an implicit encoding model (the split is defined according to the encoding on which it is based, that defines the pattern, or on the underlying low-level representation of "characters", or "char" in C/C++/Java, which are integers with known bitlengths). In Lua there's no "char", but there's just an operation allowing to index strings by byte offsets and request the value of individual bytes (which are unsigned 8-bit integers). But splitting a string arbitrarily on byte offsets is often wrong if you have to handle multibyte encodings (Lua is neutral about encodings: it could be UTF-8, some variant of UTF-16, GBK, SJIS, modified UTF-8...). The unsigned 8-bit chars are just implied by the default regexp libraru and by the default unary "#" and binary "[]" operators (which can be overriden).

You're not even required to "split" a table physically, and could just view tables or strings as general indexed collections of pairs of arbitrary types, for which there exists a way to enumerate its contents, with a pairs() or ipairs() iterator, which can also be overriden.

To make a very general split() you should take into account the pairs() or ipairs() operations defined on them, or their # and [] operators, and you need to define several mappings for the criteria of selection, a mapping for the generation of keys, and another final one that will combine in the new table the possible multiple pairs from the original table that got mapped to the same new key.

So it's much harder than what you think: a very general operation (considering all the possible options) would then be very slow. It's then simpler to not consider it in the core library and let Lua programs define their own split/join operations for what they expect.


Le sam. 1 juin 2019 à 05:25, Hugo <tkokof@163.com> a écrit :
Hi, everyone

Lua standard library do not provide "split" function, and there are a lot of implementations with different limitations, there is even a wiki about this topic(http://lua-users.org/wiki/SplitJoin).

I've wonder since a "join"(table.join) function is there, then why isn't a "split" function ?

After some searchings, I've found a related(and old) answer from Roberto : http://lua-users.org/lists/lua-l/2002-11/msg00157.html.

Seems the reason is just because trivial performance gain ?

I'm curious about more details (in my case(maybe not in common), "split" is commonly used and should be defined well).

And there is also a common(and big) question here : What is the basis of deciding which function should be added into standard library and which should not ?

tkokof1