Re: UTF-8 validation

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: UTF-8 validation
From: "Cezary H. Noweta" <chn@...>
Date: Thu, 10 Dec 2015 04:49:29 +0100

On 2015-12-10 03:56, Jay Carlson wrote:

On 2015-12-09, at 9:32 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:


On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:

Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence.

utf8.len() already does this.

utf8.len doesn't match the standard definition of UTF-8. Consider this example of an invalid sequence, taken from https://tools.ietf.org/html/rfc3629#section-4 :

utf8.len() should stay as it is a fast version for well-formed strings.There is no need to use heavy validators if we know that a given stringis valid (or ,,lightly'' ill-formed). utf8.len() correctly returns errorif it does not know what to do with a supplied data. I think, thereturned error state is intended to say ``hey, I don't know what to dowith your data'', rather then to check a validity of utf-8 strings.


Scenario should be:

1) Make sure the string is valid.
2) Do something fast to a valid string.
3) Do something other fast to a valid string.
...

If there are no intervening areas where the string could be invalidatedit is waste of time to validificate the string each time.


For Jay's idea: (1) let utf8.validate() return:

str false --> if there was not ill-formed (number 0 is not false)

str number --> number of first invalid byte (in src str)
           --> if there was ill-formed

and/or (2) third parameter (flags in one integer parameter?) stoponerror- if somebody want to write his own make_safe, then it is good idea tohave the first well-formed part of string instead of a wholevalidificated string.


???

-- best regards

Cezary H. Noweta

Follow-Ups:
- Re: UTF-8 validation, Cezary H. Noweta

References:
- UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Javier Guerra Giraldez
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Jonathan Goble
- Re: UTF-8 validation, Jay Carlson

Prev by Date: Re: UTF-8 validation
Next by Date: Re: UTF-8 validation
Previous by thread: Re: UTF-8 validation
Next by thread: Re: UTF-8 validation
Index(es):
- Date
- Thread