[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: UTF-8 validation
- From: Coda Highland <chighland@...>
- Date: Wed, 9 Dec 2015 14:58:47 -0800
On Wed, Dec 9, 2015 at 2:55 PM, Cezary H. Noweta <firstname.lastname@example.org> wrote:
> In the Lua's core I have not found a way to validating UTF-8 strings coming
> from unknown sources. According to the Unicode Standard and UTR #36
> (http://www.unicode.org/reports/tr36/#UTF-8_Exploit). A build-in
> implementation does not detect a non-shortest form.
> I have implemented a function utf8.validate(s [, allowlongnul [,
> allowsurrogates]]), which takes a string, silently gets rid of invalid
> trash, and returns a perfectly valid UTF-8 string together with a boolean
> value which determines if the source string contained valid characters only.
> Optional parameter ,,allowlongnul'' is for supporting Java's embedded NULs
> ('\xC0\x80'), and ,,allowsurrogates'' is for 16-bit Windows remnants which
> until Win98 (or even WinME, AFAIR) did not supported unicode characters
> beyond BMP. In both cases, the problematic sequences are converted to valid
> UTF-8 sequences, for example:
> utf8.validate('\xC0\x80abc'); -- => 'abc' false
> utf8.validate('\xC0\x80abc', true); -- => '\x00abc' true
> If you found above useful then take an attached ``lutf8lib.c''. The file is
> originally taken from Lua 5.3.2 and everything what was added is between
> ``/* CHN BEGIN */'' and ``/* CHN END */''.
> -- best regards
> Cezary H. Noweta
utf8.len() will return false and the position of the first invalid
byte for an invalid UTF-8 string.
You're right that it doesn't handle normalization or flags.