Re: UTF-8 validation

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: UTF-8 validation
From: Coda Highland <chighland@...>
Date: Wed, 9 Dec 2015 14:58:47 -0800

On Wed, Dec 9, 2015 at 2:55 PM, Cezary H. Noweta <chn@poczta.onet.pl> wrote:
> Hello,
>
> In the Lua's core I have not found a way to validating UTF-8 strings coming
> from unknown sources. According to the Unicode Standard and UTR #36
> (http://www.unicode.org/reports/tr36/#UTF-8_Exploit). A build-in
> implementation does not detect a non-shortest form.
>
> I have implemented a function utf8.validate(s [, allowlongnul [,
> allowsurrogates]]), which takes a string, silently gets rid of invalid
> trash, and returns a perfectly valid UTF-8 string together with a boolean
> value which determines if the source string contained valid characters only.
> Optional parameter ,,allowlongnul'' is for supporting Java's embedded NULs
> ('\xC0\x80'), and ,,allowsurrogates'' is for 16-bit Windows remnants which
> until Win98 (or even WinME, AFAIR) did not supported unicode characters
> beyond BMP. In both cases, the problematic sequences are converted to valid
> UTF-8 sequences, for example:
>
> utf8.validate('\xC0\x80abc'); -- => 'abc' false
>
> utf8.validate('\xC0\x80abc', true); -- => '\x00abc' true
>
> If you found above useful then take an attached ``lutf8lib.c''. The file is
> originally taken from Lua 5.3.2 and everything what was added is between
> ``/* CHN BEGIN */'' and ``/* CHN END */''.
>
> -- best regards
>
> Cezary H. Noweta

utf8.len() will return false and the position of the first invalid
byte for an invalid UTF-8 string.

You're right that it doesn't handle normalization or flags.

/s/ Adam

Follow-Ups:
- Re: UTF-8 validation, Cezary H. Noweta

References:
- UTF-8 validation, Cezary H. Noweta

Prev by Date: UTF-8 validation
Next by Date: Re: UTF-8 validation
Previous by thread: UTF-8 validation
Next by thread: Re: UTF-8 validation
Index(es):
- Date
- Thread