lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi Tim,

> Some common oddities I have encountered:

I've also seen quotes copied from text with other encodings: ISO
8859-1 has grave and acute accents with codes 0x91 and 0x92 and
Windows CP1250 has single and double quotation marks with codes
0x91-0x94, with all these codes being invalid in UTF8 ("an unexpected
continuation byte").

> The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors.

I've implemented fixUTF8 method in ZBS as described here:
http://notebook.kulchenko.com/programming/fixing-malformed-utf8-in-lua;
the same logic can be used for isValidUTF8(). It can probably be done
in a faster way, but it's been working well for me so far.

Paul.

On Sat, Jun 15, 2013 at 1:37 PM, Tim Hill <drtimhill@gmail.com> wrote:
> The real problem is badly-formed UTF-8 .. and there is too much of it to
> just bail with errors. Some common oddities I have encountered:
>
> -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code
> point)
> -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
> -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead
> of 4)
>
> To be honest, I'm not sure how I would approach an "IsValidUTF8()" function
> .. I always tend to fall back on the original TCP/IP philosophy: be rigorous
> in what you generate, and forgiving in what you accept.
>
> --Tim
>
> On Jun 15, 2013, at 1:08 PM, Jay Carlson <nop@nop.com> wrote:
>
> I don't understand where "false" instead of an error would be useful. Once
> you've decided to iterate over a string as UTF-8, it is a surprise when the
> string turns out not to be UTF-8, and it's unlikely your code will do
> anything useful. There could be a separate utf8.isvalid(s, [byteoffset [,
> bytelen]]) for when you're testing.
>
>