[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 validation
- From: Jay Carlson <nop@...>
- Date: Wed, 9 Dec 2015 21:29:13 -0500
On 2015-12-09, at 8:50 PM, Coda Highland <chighland@gmail.com> wrote:
>
> On Wed, Dec 9, 2015 at 5:39 PM, Javier Guerra Giraldez
> <javier@guerrag.com> wrote:
>> On Wed, Dec 9, 2015 at 7:19 PM, Cezary H. Noweta <chn@poczta.onet.pl> wrote:
>>> The simple process [ill-formed] => [well-formed] can be named validation
>>
>> Unfortunately, "validation" is taken in many circles as "verifying the
>> validity of input", without changing said input in any form. And to
>> make it worse, Unicode also defines "normalization" (which would be
>> the right term in most other contexts) as something else.
>>
>> I propose coining the ugly term "validification", meaning "making the
>> (possibly invalid) input valid"
>
> utf8.make_safe?
I think there are too many different definitions of safety for that name to be useful. Is there consensus even for "is_utf8(s)"? Roberto didn't seem convinced last time. [1]
Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence. Then people could write their own make_safe functions based on how they want to respond to syntactically invalid sequences.
Jay
[1]: http://lua-users.org/lists/lua-l/2014-05/msg00303.html