[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: htmlentities table
- From: Rici Lake <lua@...>
- Date: Fri, 28 Oct 2005 16:13:12 -0500
On 28-Oct-05, at 2:56 PM, David Given wrote:
Yes, but the pattern won't actually pick up a valid UTF8 character,
because
it's missing the prefix byte; which means you can't compare it with a
string
constant containing a UTF8 character. (Unless you string slice your
string
constant to remove the first character, of course.)
Look again. This is the prefix byte:
[^\128-\191]
The full pattern: [^\128-\191][\128-\191]
matches:
"Not a continuation byte" followed by 0 or more "continuation bytes"
That will match any 7-bit character *OR* any multibyte prefix character.
As I said, it will produce incorrect results on invalid inputs; for
example, the sequence:
0x60 0x83
is not a valid utf-8 sequence but it would be returned by the pattern I
presented.
If you didn't want the 7-bit characters to show up, say because you had
no need to translate them, you could use a different one:
[\194-\244][\128-\191]+
which will skip over some incorrect multibyte utf-8 sequences but will
certainly produce all the valid ones if the string contains only valid
sequences.
If you want the full validating utf-8 sequencer in Lua (which I have
tested against the reference you mentioned), email me.