lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Jun 9, 2011 at 2:16 PM, steve donovan <steve.j.donovan@gmail.com> wrote:
> Ah, but any plain ASCII is a degenerate (and valid) kind of UTF-8, so
> I have the old problem of how to decide:
>
> http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c

I don't follow your issue here. As clearly explained on Wikipedia [1],
not all byte sequences are valid UTF-8. Byte sequences consisting
entirely of values between 0 and 127 are fine as they have the same
meaning in UTF-8 as in ASCII. The assumption that people make is that
if text is ASCII and uses codes between 128 and 255, then at least
once it won't use two of those codes in a row, and thus will be an
invalid UTF-8 byte sequence. Obviously there are examples of exotic
ASCII strings which *are* valid UTF-8 byte streams and have different
meaning when interpreted as UTF-8, but they are generally ignored due
to being uncommon in real-world usage.

[1] http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences