lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Tim Hill wrote:
> The real problem is badly-formed UTF-8 .. and there is too much of it to
> just bail with errors. Some common oddities I have encountered:
> 
> -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying
> code point)
> -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
> -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes
> instead of 4)
> 
> To be honest, I'm not sure how I would approach an "IsValidUTF8()"
> function .. I always tend to fall back on the original TCP/IP
> philosophy: be rigorous in what you generate, and forgiving in what you
> accept.

The BOM in UTF-8 is mainly annoying for plain ASCII applications where
UTF-8 should be transparent in strings.  But as far as I remember it is
not invalid UTF-8 (though its only use is to show that text is indeed
UTF-8).  An Unicode-aware application can just ignore it.

The last point, the non-canonical UTF-8 encodings, is actually a huge
security risk that already opened holes in the field.

UTF-8 is quite often used as just an extension to ASCII (which it was
meant to be) and so some filters checked that URLs don't use "../" to
access upper directories.  They did not check all the non-canonical ways
of encoding dots and slashes so these paths went through the filter.  At
some point (I guess just before the OS API) the UTF-8 was converted
"forgivingly" to UTF-16 and suddenly the dangerous paths were used.

That is the reason why the standards say that the conversion of UTF-8 to
codepoints must not tolerate non-canonical encodings but either reject
the string completely or put some codepoint in there that signals an
encoding error (though I do not know which codepoint that was).

Best regards,

David Kolf