lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Jun 15, 2013, at 4:37 PM, Tim Hill wrote:

> The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors. Some common oddities I have encountered:
> 
> -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code point)
> -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
> -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead of 4)
> 
> To be honest, I'm not sure how I would approach an "IsValidUTF8()" function .. I always tend to fall back on the original TCP/IP philosophy: be rigorous in what you generate, and forgiving in what you accept.

If you don't decide on ingress, IsValidUTF8() is still decided, but the definition will be a global property of the codebase.[1] Similarly, if you don't decide what to do with pseudo-UTF-8 surrogates ("CESU-8"), the program as a whole gets this knowledge smeared all over it.

RFC 3629 uses the normative "MUST protect against decoding invalid sequences"; its changelog from RFC 2279 is even stronger, saying, "Turned the note warning against decoding of invalid sequences into a normative MUST NOT."[2] 

I think the right thing to do is to make the choice of how to be "forgiving in what you accept" explicit on ingress. Protocols are gonna have to nuke any BOM at start-of-stream anyway (well, the RFC has advice). There are a bunch of choices of what to do with CESU-8 and other coding errors, and they can have broad implications (notoriously including security-sensitive functions in a few old cases).

Regardless of whether you normalize on input, you will probably have to choose to do something if you will be rigorous in what you generate later. 

I regularly do have to deal with documents impossible to interpret in any normal coded character set. And it's fun to watch line noise blow up Python 3 programs reading from serial ports. There are a lot of really dirty hacks I've written to recover meaning from arbitrary octet streams. I'd rather not think about it. If I can keep those octet-level hacks on the outer shell of the program, I can have a single consistent definition of UTF-8-ness on the inside.

I mostly do plumbing. People who spend more time working with *text* care about things like canonicalization forms; there are all kinds of correctness issues above the codepoint level. For most plumbing I can ignore them and treat Unicode as a stream of codepoints, since everybody working above that level[3] is already in a world of pain. I try not to make it worse.

Jay

[1]: Well, IsValidUTF8 may be defined in the same way much of C is: arbitrary program behavior. In Lua it won't bring down the runtime without help from C though.

[2]: Uh, WTF, IETF? Is the (presumptively non-normative) changelog making stronger claims about the normativeness of its subjects?

[3]: Above the codepoint level is where you need those character property tables to survive. I'm not counting on Lua getting them, even if slnunicode/tcl did get the size down. Canonicalization is big; the only way Lua would get that is with #ifdefs to haul in the platform code the same way dlopen has to work. Something like dynamic grapheme cluster coding does seem like a win for people who want to work at the "character" level, but I don't have enough experience to be sure.