[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: htmlentities table
- From: David Given <dg@...>
- Date: Fri, 28 Oct 2005 20:56:43 +0100
On Friday 28 October 2005 18:33, Rici Lake wrote:
[...]
> A pattern which works is "[^\128-\191][\128-\191]*"
>
> This won't pickup invalid utf8 sequences, but if the string is valid
> utf8, that will pick up one utf8 sequence
[...]
> If you look at the definition of utf-8 you'll see why it works; a utf-8
> sequence is either a single 7-bit byte (i.e. < 128) or a first byte
> (actually in the range 0xC2-0xF4, but for simplicity we can say >=
> 0xC0) followed by a determined number of successor bytes, each of which
> carries 6 bits and is in the range 0x80-0xBF, or 128-191.
Yes, but the pattern won't actually pick up a valid UTF8 character, because
it's missing the prefix byte; which means you can't compare it with a string
constant containing a UTF8 character. (Unless you string slice your string
constant to remove the first character, of course.)
I'd be more inclined to use "[\128-\255]*[^\128-\255]". This will pick up
exactly one character, regardless of how long it is. If you're only
interested in non-ASCII characters, then replace the * with a + (I think;
disclaimer: untested).
[...]
> You can use an even simpler one to count the number of utf-8 sequences
> in a string; just count the number of non-successor bytes
> ([^\128-\191]).
That's a neat trick that I hadn't thought of. I'll remember it --- ta.
It's worth pointing out that the UTF8 standard actually specifies behaviour
for all the nasty edge cases where you have invalid strings. If you want
compliant string processing, it's really irritating to get them right. Here's
a document that explains it all, and lets you test it --- just load it into a
UTF8-aware text viewer:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
What we do is to run potentially-invalid UTF8 strings (like stuff that's read
over the 'net) through a armoured, error detecting UTF8 decoder and
processing it into guaranteed-correct UTF8. It's expensive, but it allows us
to use a highly optimised, simplified UTF8 decoder to process the string
later.
--
+- David Given --McQ-+
| dg@cowlark.com | Become immortal or die!
| (dg@tao-group.com) |
+- www.cowlark.com --+
Attachment:
pgp5wuC_uVxOf.pgp
Description: PGP signature