[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: htmlentities table
- From: Rici Lake <lua@...>
- Date: Fri, 28 Oct 2005 16:13:12 -0500
On 28-Oct-05, at 2:56 PM, David Given wrote:
Yes, but the pattern won't actually pick up a valid UTF8 character,
it's missing the prefix byte; which means you can't compare it with a
constant containing a UTF8 character. (Unless you string slice your
constant to remove the first character, of course.)
Look again. This is the prefix byte:
The full pattern: [^\128-\191][\128-\191]
"Not a continuation byte" followed by 0 or more "continuation bytes"
That will match any 7-bit character *OR* any multibyte prefix character.
As I said, it will produce incorrect results on invalid inputs; for
example, the sequence:
is not a valid utf-8 sequence but it would be returned by the pattern I
If you didn't want the 7-bit characters to show up, say because you had
no need to translate them, you could use a different one:
which will skip over some incorrect multibyte utf-8 sequences but will
certainly produce all the valid ones if the string contains only valid
If you want the full validating utf-8 sequencer in Lua (which I have
tested against the reference you mentioned), email me.