[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 validation
- From: Coda Highland <chighland@...>
- Date: Wed, 9 Dec 2015 15:55:20 -0800
On Wed, Dec 9, 2015 at 3:32 PM, Cezary H. Noweta <chn@poczta.onet.pl> wrote:
> On 2015-12-09 23:58, Coda Highland wrote:
>
>> utf8.len() will return false and the position of the first invalid
>> byte for an invalid UTF-8 string.
>
>
> Indeed, however my function's purpose is not testing if a string is valid
> but the following flow:
>
> [unknown string] => [black box] => [valid string].
>
> in one simple step. This comes from an Unicode's recommendation. After that
> I know that there are no 4/6-byte backslashes or quotes for a SQLinj and
> other fancy pitfalls.
>
> Today, non-shortest forms are very dangerous - Lua's utf8_decode is
> susceptible to this (there is no need to correct this as long as a string is
> valid). Conciseness of UTF-8 allows to treat strings as plain ASCII ones -
> it is frequently used and can be very danger.
>
> The first thing to do with an unknown string (just after its length is
> determined) is to validate it. After you have treated a string by my
> utf8.validate, you can apply less secure but very efficient functions (like
> above utf8_decode, for example).
>
>
> -- best regards
>
> Cezary H. Noweta
>
Then I submit that your function would be better named "normalize"
than "validate" since its intent in the workflow is to provide you
with a safe, canonical form of the string, and it just also lets you
know if it encountered problems along the way.
/s/ Adam