[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: [ANN] Winapi - a minimal but useful Windows API binding
- From: Peter Cawley <lua@...>
- Date: Thu, 9 Jun 2011 14:32:22 +0100
On Thu, Jun 9, 2011 at 2:16 PM, steve donovan <steve.j.donovan@gmail.com> wrote:
> Ah, but any plain ASCII is a degenerate (and valid) kind of UTF-8, so
> I have the old problem of how to decide:
>
> http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c
I don't follow your issue here. As clearly explained on Wikipedia [1],
not all byte sequences are valid UTF-8. Byte sequences consisting
entirely of values between 0 and 127 are fine as they have the same
meaning in UTF-8 as in ASCII. The assumption that people make is that
if text is ASCII and uses codes between 128 and 255, then at least
once it won't use two of those codes in a row, and thus will be an
invalid UTF-8 byte sequence. Obviously there are examples of exotic
ASCII strings which *are* valid UTF-8 byte streams and have different
meaning when interpreted as UTF-8, but they are generally ignored due
to being uncommon in real-world usage.
[1] http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences