lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Robert Raschke a écrit :
You can find most of the Plan 9 tools (including the UTF-8 support and
many, many, many uses of it) ported to unix at



Re: UTF-8 [was Re: LuaSocket http ftp smpt...]
Klaus Ripke <>
Thu, 3 Feb 2005 13:01:44 +0100
Lua list <>

Lua list <>

On Thursday 03 February 2005 12:29, David Burgess wrote:
phillip hazel's PCRE implementation does UTF-8 rather well. If you
are looking for a UTF-8 base.  It may be worth a look.
thx, that's pretty much the right thing, but also kind of a biggy.

Their character property table amounts to 88K,
while the Plan 9 thingy is about 12K
(not 2K, dropped the 1 in the previous post).

In PCRE 6.7 ChangeLog:

Version 6.5 01-Feb-06
18. Changes to the handling of Unicode character properties:

    (a) Updated the table to Unicode 4.1.0.

    (b) Recognize characters that are not in the table as "Cn" (undefined).

(c) I revised the way the table is implemented to a much improved format which includes recognition of ranges. It now supports the ranges that
        are defined in UnicodeData.txt, and it also amalgamates other
characters into ranges. This has reduced the number of entries in the
        table from around 16,000 to around 3,000, thus reducing its size
considerably. I realized I did not need to use a tree structure after
        all - a binary chop search is just as efficient. Having reduced the
        number of entries, I extended their size from 6 bytes to 8 bytes to
        allow for more data.

(d) Added support for Unicode script names via properties such as \p{Han}.

PCRE 6.4's ucptable.c: 443KB
PCRE 6.7's ucptable.c: 87KB

Should reduce slightly the compiled size.

Philippe Lhoste
--  (near) Paris -- France
--  --  --  --  --  --  --  --  --  --  --  --  --  --