[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 validation
- From: Hisham <h@...>
- Date: Thu, 10 Dec 2015 17:12:41 -0200
On 10 December 2015 at 04:22, Cezary H. Noweta <chn@poczta.onet.pl> wrote:
>
> On 2015-12-10 04:49, Cezary H. Noweta wrote:
>
>>>> On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
>>>>>
>>>>> Given a string where is_utf8(s) is false, it might be nice to be
>>>>> able to find the byte offset of the first non-UTF-8 sequence.
>
>
>> For Jay's idea: (1) let utf8.validate() return:
>>
>> str false --> if there was not ill-formed (number 0 is not false)
>>
>> str number --> number of first invalid byte (in src str)
>> --> if there was ill-formed
>>
>> and/or (2) third parameter (flags in one integer parameter?) stoponerror
>> - if somebody want to write his own make_safe, then it is good idea to
>> have the first well-formed part of string instead of a whole
>> validificated string.
>
>
> OK - now the function returns:
>
> str false --> if every thing is ok
>
> str number --> if there was an error;
> number is position in the source string
> of invalid character
Minor detail, but using "false" to indicate no-error is something I've
never seen in Lua APIs. (It's common in C to mean 0 = OK, but `false`
as OK in Lua strikes me as strange). I think nil would be more
idiomatic there, to indicate "absence" of the error position. And you
could still do:
local s, errpos = utf8.validate("bla")
if errpos then
-- got an error position
end
> http://lua.chncc.eu/utf8/201512100653/lutf8lib.c
It would be nice if this function was implemented as part of a
standalone module that could be deployed separately, instead of
injecting into the standard `utf8` table. (It would be nicer for
distributing with LuaRocks, too). utf8check, maybe?
Also, "sanitize" sounds like a good name for the function itself.
-- Hisham
- References:
- UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Javier Guerra Giraldez
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Jonathan Goble
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Cezary H. Noweta