Re: htmlentities table

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: htmlentities table
From: David Given <dg@...>
Date: Fri, 28 Oct 2005 20:56:43 +0100

On Friday 28 October 2005 18:33, Rici Lake wrote:
[...]
> A pattern which works is "[^\128-\191][\128-\191]*"
>
> This won't pickup invalid utf8 sequences, but if the string is valid
> utf8, that will pick up one utf8 sequence
[...]
> If you look at the definition of utf-8 you'll see why it works; a utf-8
> sequence is either a single 7-bit byte (i.e. < 128) or a first byte
> (actually in the range 0xC2-0xF4, but for simplicity we can say >=
> 0xC0) followed by a determined number of successor bytes, each of which
> carries 6 bits and is in the range 0x80-0xBF, or 128-191.

Yes, but the pattern won't actually pick up a valid UTF8 character, because 
it's missing the prefix byte; which means you can't compare it with a string 
constant containing a UTF8 character. (Unless you string slice your string 
constant to remove the first character, of course.)

I'd be more inclined to use "[\128-\255]*[^\128-\255]". This will pick up 
exactly one character, regardless of how long it is. If you're only 
interested in non-ASCII characters, then replace the * with a + (I think; 
disclaimer: untested).

[...]
> You can use an even simpler one to count the number of utf-8 sequences
> in a string; just count the number of non-successor bytes
> ([^\128-\191]).

That's a neat trick that I hadn't thought of. I'll remember it --- ta.

It's worth pointing out that the UTF8 standard actually specifies behaviour 
for all the nasty edge cases where you have invalid strings. If you want 
compliant string processing, it's really irritating to get them right. Here's 
a document that explains it all, and lets you test it --- just load it into a 
UTF8-aware text viewer:

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

What we do is to run potentially-invalid UTF8 strings (like stuff that's read 
over the 'net) through a armoured, error detecting UTF8 decoder and 
processing it into guaranteed-correct UTF8. It's expensive, but it allows us 
to use a highly optimised, simplified UTF8 decoder to process the string 
later.

-- 
+- David Given --McQ-+ 
|  dg@cowlark.com    | Become immortal or die!
| (dg@tao-group.com) | 
+- www.cowlark.com --+

Attachment: pgpJUOtNH9MYv.pgp
Description: PGP signature

Follow-Ups:
- Re: htmlentities table, Rici Lake

References:
- htmlentities table, Walter Cruz
- Re: htmlentities table, PA
- Re: htmlentities table, Rici Lake

Prev by Date: Re: O'Reilly
Next by Date: How catch new globals?
Previous by thread: Re: htmlentities table
Next by thread: Re: htmlentities table
Index(es):
- Date
- Thread