[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Of Unicode in the next Lua version
- From: Paul K <paulclinger@...>
- Date: Sat, 15 Jun 2013 14:52:54 -0700
> Some common oddities I have encountered:
I've also seen quotes copied from text with other encodings: ISO
8859-1 has grave and acute accents with codes 0x91 and 0x92 and
Windows CP1250 has single and double quotation marks with codes
0x91-0x94, with all these codes being invalid in UTF8 ("an unexpected
> The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors.
I've implemented fixUTF8 method in ZBS as described here:
the same logic can be used for isValidUTF8(). It can probably be done
in a faster way, but it's been working well for me so far.
On Sat, Jun 15, 2013 at 1:37 PM, Tim Hill <firstname.lastname@example.org> wrote:
> The real problem is badly-formed UTF-8 .. and there is too much of it to
> just bail with errors. Some common oddities I have encountered:
> -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code
> -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
> -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead
> of 4)
> To be honest, I'm not sure how I would approach an "IsValidUTF8()" function
> .. I always tend to fall back on the original TCP/IP philosophy: be rigorous
> in what you generate, and forgiving in what you accept.
> On Jun 15, 2013, at 1:08 PM, Jay Carlson <email@example.com> wrote:
> I don't understand where "false" instead of an error would be useful. Once
> you've decided to iterate over a string as UTF-8, it is a surprise when the
> string turns out not to be UTF-8, and it's unlikely your code will do
> anything useful. There could be a separate utf8.isvalid(s, [byteoffset [,
> bytelen]]) for when you're testing.