Re: UTF-8 validation

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: UTF-8 validation
From: Jay Carlson <nop@...>
Date: Wed, 9 Dec 2015 21:29:13 -0500

On 2015-12-09, at 8:50 PM, Coda Highland <chighland@gmail.com> wrote:
> 
> On Wed, Dec 9, 2015 at 5:39 PM, Javier Guerra Giraldez
> <javier@guerrag.com> wrote:
>> On Wed, Dec 9, 2015 at 7:19 PM, Cezary H. Noweta <chn@poczta.onet.pl> wrote:
>>> The simple process [ill-formed] => [well-formed] can be named validation
>> 
>> Unfortunately, "validation" is taken in many circles as "verifying the
>> validity of input", without changing said input in any form.  And to
>> make it worse, Unicode also defines "normalization" (which would be
>> the right term in most other contexts) as something else.
>> 
>> I propose coining the ugly term "validification", meaning "making the
>> (possibly invalid) input valid"
> 
> utf8.make_safe?

I think there are too many different definitions of safety for that name to be useful. Is there consensus even for "is_utf8(s)"? Roberto didn't seem convinced last time. [1] 

Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence. Then people could write their own make_safe functions based on how they want to respond to syntactically invalid sequences.

Jay

[1]: http://lua-users.org/lists/lua-l/2014-05/msg00303.html

Follow-Ups:
- Re: UTF-8 validation, Jonathan Goble

References:
- UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Javier Guerra Giraldez
- Re: UTF-8 validation, Coda Highland

Prev by Date: Re: UTF-8 validation
Next by Date: Re: UTF-8 validation
Previous by thread: Re: UTF-8 validation
Next by thread: Re: UTF-8 validation
Index(es):
- Date
- Thread