lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 2015-12-10 03:56, Jay Carlson wrote:
On 2015-12-09, at 9:32 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:

On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence.

utf8.len() already does this.

utf8.len doesn't match the standard definition of UTF-8. Consider this example of an invalid sequence, taken from https://tools.ietf.org/html/rfc3629#section-4 :

utf8.len() should stay as it is a fast version for well-formed strings. There is no need to use heavy validators if we know that a given string is valid (or ,,lightly'' ill-formed). utf8.len() correctly returns error if it does not know what to do with a supplied data. I think, the returned error state is intended to say ``hey, I don't know what to do with your data'', rather then to check a validity of utf-8 strings.

Scenario should be:

1) Make sure the string is valid.
2) Do something fast to a valid string.
3) Do something other fast to a valid string.
...

If there are no intervening areas where the string could be invalidated it is waste of time to validificate the string each time.

For Jay's idea: (1) let utf8.validate() return:

str false --> if there was not ill-formed (number 0 is not false)

str number --> number of first invalid byte (in src str)
           --> if there was ill-formed

and/or (2) third parameter (flags in one integer parameter?) stoponerror - if somebody want to write his own make_safe, then it is good idea to have the first well-formed part of string instead of a whole validificated string.

???

-- best regards

Cezary H. Noweta