lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 2015-12-10 04:49, Cezary H. Noweta wrote:

On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
Given a string where is_utf8(s) is false, it might be nice to be
able to find the byte offset of the first non-UTF-8 sequence.

For Jay's idea: (1) let utf8.validate() return:

str false --> if there was not ill-formed (number 0 is not false)

str number --> number of first invalid byte (in src str)
            --> if there was ill-formed

and/or (2) third parameter (flags in one integer parameter?) stoponerror
- if somebody want to write his own make_safe, then it is good idea to
have the first well-formed part of string instead of a whole
validificated string.

OK - now the function returns:

str false --> if every thing is ok

str number --> if there was an error;
               number is position in the source string
               of invalid character

There is the third parameter which (when true) causes exit
if an invalid char has been encountered. (Returned string has valid characters --- including reencoded NULs and surrogates if any --- until that point).

utf8.validate(s [, allowlongnul [, allowsurrogates [, stoponerror]]])

utf8.validate('abc\xC0\x80def'); --> 'abcdef' 4

utf8.validate('abc\xC0\x80def', true); --> 'abc\x00def' false

utf8.validate('abc\xC0\x80def', false, false, true); --> 'abc' 4

utf8.validate('\xED\xA0\x80\xED\xB0\x80', false, true);
  --> '\xF0\x90\x80\x80' false

http://lua.chncc.eu/utf8/201512100653/lutf8lib.c

MD5 33e229ccb8199ece764bf6eef3f8c00a

-- best regards

Cezary H. Noweta