lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors. Some common oddities I have encountered:

-- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code point)
-- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
-- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead of 4)

To be honest, I'm not sure how I would approach an "IsValidUTF8()" function .. I always tend to fall back on the original TCP/IP philosophy: be rigorous in what you generate, and forgiving in what you accept.

--Tim

On Jun 15, 2013, at 1:08 PM, Jay Carlson <nop@nop.com> wrote:

I don't understand where "false" instead of an error would be useful. Once you've decided to iterate over a string as UTF-8, it is a surprise when the string turns out not to be UTF-8, and it's unlikely your code will do anything useful. There could be a separate utf8.isvalid(s, [byteoffset [, bytelen]]) for when you're testing.