lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


fix '[\248-\245]', replace by  '[\248-\255]'  in the lenientUtf8 table.

Le sam. 1 juin 2019 à 08:16, Philippe Verdy <verdy_p@wanadoo.fr> a écrit :
And then we have the same logic as table.split() with this new breaker model, applied to strings.

For strings, the function can still return aggregates of arbitrary type (not just strings), the length of that aggregate being used as the default length indicator.

But string.split(string, breaker) could also have a "breaker" which is is not necessarily a function, it could be interpreted as a standard Lua regexp (implicity anchored at start with '^'), or a sequence of regexps (because Lua still does not have '|' supported), that will be tested in order, if there's a match, the matched substring is "returned" with its length (it is an error treated like an EOF, if the match length is 0, which stops the parsing, ignoring the rest of the string given to string.split()).

Given this:
  lenientUtf8 = { '[\0-\191]', '[\192-\223][\128-\191]?', '[\224-\223][\128-\191]{0,2}', '[\240-\247][\128-\191]{0,3}', '[\248-\245]' },
  strictUtf8 = { '[\0-\127]', '[\192-\223][\128-\191]', '[\224-\223][\128-\191]{2}', '[\240-\247][\128-\191]{3}' },

then:
  string.split('déjà\193\130été\128', lenientUTF8) woult return { 'd', 'é', 'j', 'à', '\193\130', 'é', 't', 'é', '\128' }
  string.split('déjà\193\130été\128', strictUTF8) woult return { 'd', 'é', 'j', 'à'  }


Le sam. 1 juin 2019 à 07:50, Philippe Verdy <verdy_p@wanadoo.fr> a écrit :
We could as well be even more general for the breaker function:

Instead of reciveing a pair of arguments, it will receive only one: it will be the *whole* (non-empty) sequence that remains to process.

And instead of returning a boolean, it will return any value, which will be a pair of values:
- an *aggregate* (possibly a substring, or a computed sum or average or any other value representing the "parsed" token that will be added to the new sequence built and returned by split()
- a *length* indicator, i.e. an integer > 0 indicating how many items of the sequence were matched: if it returns 0, it is like an EOF condition, meaning: stop processing here; if it returns nil or does not return that second value, it is assumed to be the length of the returned aggregate (if that aggregate is a number, the integer is assumed to be 1, if it's a table or function, the integer is assumed to be also 1; it is treated like 0 if the value is not a finite number, i.e. an NaN or infinite, or if the integer is negative; if the value is a number > 0 but not an integer, it is truncated to an integer).

This length indicator is then used to define how many items to skip in the input before scanning the rest. The first aggregate is what will be stored in the returned sequence if the indicator was a number>=1.

Now the breaker function can use any rule it wants for its internal lexer, and preprocess the matched subsequence to return it with any transform. It can be hungry or not, decide to process only 1 element at a time without any parsing and return directly that element.

So:
   table.split({20, 11, 50, 38, ['A']=7}, function (a) return a[1] end)
will return {20, 11, 50, 38}.

And:
   table.split({20, 11, 50, 38, ['A']=7}, function (a) return a[1]+1 end)
will return {21, 12, 51, 39}.

And
   table.split({20, 11, 50, 38, ['A']=7}, function (a) return a[1]+a[2], 2 end)
will return {31, 88}.

A more complex example would return aggregates for all values that have the same parity (i.e. whose value modulo 2 is equal), or that are in the same range (e.g. the same integer quotient of the value divided by 100), and could also aggregate them or put them in a sequence that the breaker would build itself, and then return that sequence and its length.





Le sam. 1 juin 2019 à 07:12, Philippe Verdy <verdy_p@wanadoo.fr> a écrit :
Now let's consider that table.join() is ionly defined as the operation of transforming a sequence of sequences into a single sequence preserving the total length.

What would be a table.split()  ? It would transform a sequence into a sequence of sequences. But then you need a "break" condition, which can be formulated as a function break(a,b) returning a boolean: "do we break between value a and b ?".

The simplest function is a constant function returning true: table.split() would then create a new sequence (as a table indexed by integers) where all elements of the initial sequence (the part of the source table that has integer keys in a contiguous range starting at 1) will be replaced by a singleton table.

So table.split({21,20,['A']=6}) would then return {{21}, {20}}, discarding the pair('A', 6).

A more useful table.split() would have a break function (or constant) in parameter, so that table.split({21,20,['A']=6}, true) would be the equivalent of table.split({21,20,['A']=6}).

If the break function is the false constant, no break will ever occur so that: table.split({21,20,['A']=6}) will return {{21, 20}}, i.e. a sequence of length 1, containing the sequence of the two values.

This can be generalized to strings as well (items of their sequences are substrings of length 1, holding one 8-bit byte) to create a split() operation that will implement the common breaker algorithms used in decoders for common multibyte encodings (UTF-18, UTF-16(BE), UTF-32(BE), HKCS, GBK/GB2312...).

So, string.split('été', utf8Break), where 'été' is assumed to be here in UTF-8, would return {'é', 't', 'é'}: we just need the function utf8Break to test pairs of strings and return if we must break between them or not (the 1st string will grow as long as we return false, and will be reset to an empty string after the break function returns true, then the split() function will need to enumerate the two next characters; the brek function will not be called with pairs of strings containing an empty string.

The break function can then be any suitable "lexer".


Le sam. 1 juin 2019 à 06:51, Philippe Verdy <verdy_p@wanadoo.fr> a écrit :
Note that for strings, the solution may be simpler: splitting can be viewed as equivalent to a process of "reading" items from it using a lexical scanner (but that will also possibly drop items that don't match the lexical rules).

So the split of strings could be formulated using in parameter a regexp that will match individual items in the string. And Lua already has that with the match() operation (except it can match a regexp anywhere on arbitrary lengths, but the regexp could be anchored from the start or end of the string, and will also need to be hungry or or not: match the longest substring or the shortest ones; if it's not hungry it may finally discard from the matches their trailers that won't match the regexp (e.g. it could match the start of a word, anchored by the start of string or a previous whitespace or punctuation, but then the rest of the word won't match the anchor; and there remains the case of how to handle the word separators !).

Once again we fall in too many possible combinations.

Even the default table.join() operation could be criticized for not fitting all problems (because it does not care at all about existing keys, it is assumed only to handle sequences of values, not necessarily unique, with an ordered key with arbitrary value, and to preserve only that order and the total number of elements in the resulting sequence, assuming also that all items from the first sequence will come after those of the second sequence, meaning that table.join() assumes to work on a sequence of sequences). table.join() is then not a general joining operation for tables, only for sequences of sequences; so for the rest, its behavior is not clearly defined on arbitrary tables.


Le sam. 1 juin 2019 à 06:25, Philippe Verdy <verdy_p@wanadoo.fr> a écrit :
joining is simple to define because there's not a lot of ways to do that... Except that you may wonder what is the effect of joining two tables whose order is not defined (it is defined only on sequences), and you'll fall to the problem of how to solve conflicting keys (it you want to keep them in the joined table) or how to compute new keys in the joined table (even when joining sequences, at least one of them must be renumbered if you want to keep all values but at a different set of keys, this implies an arithmetic transform at least).
For splitting, the problem is: what is the splitting criteria: a criteria on keys, or a criteria on values ? And there are also endless variations of both criterias.
Howevever a common splitting pattern is the "select" operation that would split a table in two parts: one for all pairs that match a condition, and another for all pairs that don't match it. 

And even in this case, you may be interested to keep the original keys, or discards keys and generate new sequences, so you also need a mapping from the old keys to the new ones.
In a string, the problem is equivalent (just think about how items in a string should be indexed/counted: do you mean splitting by byte offset, or by substrings that match a criteria, such as an UTF-8 sequence?

There's no general solution to math all problems. The existing implementation of split() operations on strings are based on an implicit encoding model (the split is defined according to the encoding on which it is based, that defines the pattern, or on the underlying low-level representation of "characters", or "char" in C/C++/Java, which are integers with known bitlengths). In Lua there's no "char", but there's just an operation allowing to index strings by byte offsets and request the value of individual bytes (which are unsigned 8-bit integers). But splitting a string arbitrarily on byte offsets is often wrong if you have to handle multibyte encodings (Lua is neutral about encodings: it could be UTF-8, some variant of UTF-16, GBK, SJIS, modified UTF-8...). The unsigned 8-bit chars are just implied by the default regexp libraru and by the default unary "#" and binary "[]" operators (which can be overriden).

You're not even required to "split" a table physically, and could just view tables or strings as general indexed collections of pairs of arbitrary types, for which there exists a way to enumerate its contents, with a pairs() or ipairs() iterator, which can also be overriden.

To make a very general split() you should take into account the pairs() or ipairs() operations defined on them, or their # and [] operators, and you need to define several mappings for the criteria of selection, a mapping for the generation of keys, and another final one that will combine in the new table the possible multiple pairs from the original table that got mapped to the same new key.

So it's much harder than what you think: a very general operation (considering all the possible options) would then be very slow. It's then simpler to not consider it in the core library and let Lua programs define their own split/join operations for what they expect.


Le sam. 1 juin 2019 à 05:25, Hugo <tkokof@163.com> a écrit :
Hi, everyone

Lua standard library do not provide "split" function, and there are a lot of implementations with different limitations, there is even a wiki about this topic(http://lua-users.org/wiki/SplitJoin).

I've wonder since a "join"(table.join) function is there, then why isn't a "split" function ?

After some searchings, I've found a related(and old) answer from Roberto : http://lua-users.org/lists/lua-l/2002-11/msg00157.html.

Seems the reason is just because trivial performance gain ?

I'm curious about more details (in my case(maybe not in common), "split" is commonly used and should be defined well).

And there is also a common(and big) question here : What is the basis of deciding which function should be added into standard library and which should not ?

tkokof1