lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



On 28-Oct-05, at 11:15 AM, PA wrote:
Assuming a long list of character substitutions (e.g. "Ä" -> "A", etc):

http://dev.alt.textdrive.com/file/lu/LUStringBasicLatin.txt

What would be a reasonable implementation to actually perform the substitutions?

I thought I'd posted this recently.

A pattern which works is "[^\128-\191][\128-\191]*"

This won't pickup invalid utf8 sequences, but if the string is valid utf8, that will pick up one utf8 sequence, so you can use it to iterate over the utf8 string with gsub or gfind.

If you look at the definition of utf-8 you'll see why it works; a utf-8 sequence is either a single 7-bit byte (i.e. < 128) or a first byte (actually in the range 0xC2-0xF4, but for simplicity we can say >= 0xC0) followed by a determined number of successor bytes, each of which carries 6 bits and is in the range 0x80-0xBF, or 128-191. The key is that all but the first byte are 0x80-0xBF, and the first byte is not one of those, which is exactly what the above pattern says.

You can use an even simpler one to count the number of utf-8 sequences in a string; just count the number of non-successor bytes ([^\128-\191]).

It is quite easy to actually verify the string's utf-8 validity as well, but I'll leave that as an exercise for the reader. It is one of the use cases I have for the modification to Mike Pall's gsub patch I posted a few days ago, since it involves using the head character of the sequence to select a function which matches the remainder of the sequence against one of half a dozen or so validators.