Multibyte/UTF-8/Unicode String Library for Lua

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Multibyte/UTF-8/Unicode String Library for Lua
From: Jan Behrens <jan.behrens.n4272.expires-2008-06@...>
Date: Sun, 22 Jul 2007 16:13:13 +0000

Hi,

i've written a multibyte string library for lua 5.1. For UTF-8 itcontains an interface to the utf8proc library (version 1.1.1 required).The library doesn't solve all unicode issues, but it can be helpful ifone wants to do things like:


- checking validity of a UTF-8 string

- Unicode normalization (NFC, NFD, NFKC or NFKD) (including Hangul)

- stripping of "default ignorable characters"
  like SOFT-HYPHEN or ZERO-WIDTH-SPACE

- accessing a byte-string by its character indicies (in the meaning of
  grapheme clusters according to UAX#29, NOT UTF-32 "characters"),
  rather than by its byte indicies, for example to get the first 30
  characters of a string

- case-folding for case-insensive string comparison

Unicode might be the best available standard for international text, butin my oppinion it has some REALLY ANNOYING design flaws, especially ifyou want to use it in a general way, instead of just using it forcharacter set transformation or archival storage. Treating Unicode asthe "only universal standard" at the expense of abstractness is a badidea IMHO!

One fact, some people are not aware of, is that 16-bit are notsufficient to represent a complete character (grapheme cluster) inUnicode. According to my knowledge not even 32-bit are sufficient torepresent a complete character (please correct me, if i'm wrong). Whatyou can archive with 32 bits (or at least 21 bits) is to store a unicodecodepoint, but a character in the sense of a "grapheme cluster" mayconsist of multiple codepoints. The way to calculate grapheme clusterboundaries is quite complicated, needs huge tables, and can be extendedin future unicode versions, to support new characters.

See http://www.unicode.org/reports/tr29/tr29-11.html for furtherinformation.

The "mbstr" library stores a multi-byte string as userdata based on thefollowing C struct:


typedef struct mbstr_lua {
  void (*free)(struct mbstr_lua *mbstr);  // frees 'offsets' and 'data'
                                          // arrays
  int nclusters;
  int nbytes;
  int *offsets;  // (nclusters + 1) elements, offsets[nclusters] = nbytes
  char *data;    // (nbytes + 1) elements, data[nbytes] = 0
} mbstr_lua_t;

The field 'data' points to the byte values used to represent the string,while the field 'offsets' points to an array of offsets, indicating thestarting position of each character (or "grapheme cluster" in Unicode).By doing things this way, C functions accessing the userdata do not needto be aware of the actual algorithm, which was used to determinecharacter boundaries, and the time for accessing a character at aparticular index remains constant (not dependent on the total stringlength).

On the "lua side" you can create the multi-byte string object by callingmbstr.from_singlebytestring(str) or mbstr.from_utf8(str). The latterfunction also accepts a table to select additional mappings to be done.It can contain the following fields:


- "stable"    (if true, then follow unicodes versioning stability)
- "compat"    (if true, then replace compatibility characters)
- "compose"   (if true, then compose characters into one codepoint if
               possible)
- "decompose" (if true, then decompose characters)
- "ignore"    (if true, then strip default ignorable characters)
- "rejectna"  (if true, then return nil for strings containing
               non-assigned code points)
- "nlf"       (convert NLF chars to LS, PS or LF
               ("ls", "ps", "lf" as value))
- "stripcc"   (if true, then strip control characters)
- "casefold"  (if true, then do a case-folding)
- "lump"      (if true, perform replacements according to lump.txt)
- "stripmark" (if true, then strip away marks
               (like accents, diaeresis, ...))

Note: mbstr.from_utf8 will return "nil" as a consequence of invalidinput, e.g. invalid UTF-8 sequences, UTF-8 encoded 16 bit surrogates orincomplete grapheme clusters.

To easily include constant multi-byte string objects in your source, youcould define a function like "u":


function u(str)
  mbstr.from_utf8(str, { stable = true, compose = true })
end
hello_world = u"Hello World!"

The options { stable = true, compose = true } are used to selectNFC-normalization.In the example above the multi-byte string object is created atrun-time, it will become inefficient, if you use such constructs in a loop.

As I said before, my approach doesn't solve all unicode problems. Using"userdata" for storing strings is rather a hack than a solution. (Forexample, I experienced trouble when using table.concat to concatenatethe multi-byte strings.) The library is also incomplete yet, it doesn'tsupport things like string.format, string.find, etc. Perhaps someonelikes to extend it.


You can find the latest version of utf8proc (1.1.1) at:
http://www.flexiguided.de/pub/utf8proc-v1.1.1.tar.gz

The multibyte extension for lua is available at:
http://www.flexiguided.de/pub/lua-mbstr-v0.1.tar.gz

Please note that the lua extension has not been tested much and isdependent on the latest(!) version of libutf8proc.

I would like to ask, what is planned for the next versions of luarelated to multi-byte charsets?



Jan Behrens

--
FlexiGuided GmbH
Johannistr. 12
10117 Berlin
Fon: [030] 9789.4550
Fax: [030] 9789.4551

www.flexiguided.de
info@flexiguided.de

Geschaeftsfuehrer:
Juergen Axel Kistner
Andreas Nitsche

Registergericht:
Amtsgericht Berlin-Charlottenburg

Registernummer:
HRB 97488 B

Ust-IdNr.:
DE814439999

Prev by Date: Re: luasql help needed - newbie
Next by Date: Re: Lua Socket
Previous by thread: Re: passing nil as an argument
Next by thread: Info: Garbage collection crash, due to static/dynamic linkage mix
Index(es):
- Date
- Thread