lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



On 28-Oct-05, at 2:56 PM, David Given wrote:

Yes, but the pattern won't actually pick up a valid UTF8 character, because it's missing the prefix byte; which means you can't compare it with a string constant containing a UTF8 character. (Unless you string slice your string
constant to remove the first character, of course.)

Look again. This is the prefix byte:

[^\128-\191]

The full pattern: [^\128-\191][\128-\191]

matches:

"Not a continuation byte" followed by 0 or more "continuation bytes"

That will match any 7-bit character *OR* any multibyte prefix character.

As I said, it will produce incorrect results on invalid inputs; for example, the sequence:

  0x60 0x83

is not a valid utf-8 sequence but it would be returned by the pattern I presented.

If you didn't want the 7-bit characters to show up, say because you had no need to translate them, you could use a different one:

   [\194-\244][\128-\191]+

which will skip over some incorrect multibyte utf-8 sequences but will certainly produce all the valid ones if the string contains only valid sequences.

If you want the full validating utf-8 sequencer in Lua (which I have tested against the reference you mentioned), email me.