lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Friday 28 October 2005 18:33, Rici Lake wrote:
[...]
> A pattern which works is "[^\128-\191][\128-\191]*"
>
> This won't pickup invalid utf8 sequences, but if the string is valid
> utf8, that will pick up one utf8 sequence
[...]
> If you look at the definition of utf-8 you'll see why it works; a utf-8
> sequence is either a single 7-bit byte (i.e. < 128) or a first byte
> (actually in the range 0xC2-0xF4, but for simplicity we can say >=
> 0xC0) followed by a determined number of successor bytes, each of which
> carries 6 bits and is in the range 0x80-0xBF, or 128-191.

Yes, but the pattern won't actually pick up a valid UTF8 character, because 
it's missing the prefix byte; which means you can't compare it with a string 
constant containing a UTF8 character. (Unless you string slice your string 
constant to remove the first character, of course.)

I'd be more inclined to use "[\128-\255]*[^\128-\255]". This will pick up 
exactly one character, regardless of how long it is. If you're only 
interested in non-ASCII characters, then replace the * with a + (I think; 
disclaimer: untested).

[...]
> You can use an even simpler one to count the number of utf-8 sequences
> in a string; just count the number of non-successor bytes
> ([^\128-\191]).

That's a neat trick that I hadn't thought of. I'll remember it --- ta.

It's worth pointing out that the UTF8 standard actually specifies behaviour 
for all the nasty edge cases where you have invalid strings. If you want 
compliant string processing, it's really irritating to get them right. Here's 
a document that explains it all, and lets you test it --- just load it into a 
UTF8-aware text viewer:

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

What we do is to run potentially-invalid UTF8 strings (like stuff that's read 
over the 'net) through a armoured, error detecting UTF8 decoder and 
processing it into guaranteed-correct UTF8. It's expensive, but it allows us 
to use a highly optimised, simplified UTF8 decoder to process the string 
later.

-- 
+- David Given --McQ-+ 
|  dg@cowlark.com    | Become immortal or die!
| (dg@tao-group.com) | 
+- www.cowlark.com --+ 

Attachment: pgpMc2DgVVJ0M.pgp
Description: PGP signature