lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi,

i've written a multibyte string library for lua 5.1. For UTF-8 it contains an interface to the utf8proc library (version 1.1.1 required). The library doesn't solve all unicode issues, but it can be helpful if one wants to do things like:

- checking validity of a UTF-8 string

- Unicode normalization (NFC, NFD, NFKC or NFKD) (including Hangul)

- stripping of "default ignorable characters"
  like SOFT-HYPHEN or ZERO-WIDTH-SPACE

- accessing a byte-string by its character indicies (in the meaning of
  grapheme clusters according to UAX#29, NOT UTF-32 "characters"),
  rather than by its byte indicies, for example to get the first 30
  characters of a string

- case-folding for case-insensive string comparison


Unicode might be the best available standard for international text, but in my oppinion it has some REALLY ANNOYING design flaws, especially if you want to use it in a general way, instead of just using it for character set transformation or archival storage. Treating Unicode as the "only universal standard" at the expense of abstractness is a bad idea IMHO!

One fact, some people are not aware of, is that 16-bit are not sufficient to represent a complete character (grapheme cluster) in Unicode. According to my knowledge not even 32-bit are sufficient to represent a complete character (please correct me, if i'm wrong). What you can archive with 32 bits (or at least 21 bits) is to store a unicode codepoint, but a character in the sense of a "grapheme cluster" may consist of multiple codepoints. The way to calculate grapheme cluster boundaries is quite complicated, needs huge tables, and can be extended in future unicode versions, to support new characters.

See http://www.unicode.org/reports/tr29/tr29-11.html for further information.


The "mbstr" library stores a multi-byte string as userdata based on the following C struct:

typedef struct mbstr_lua {
  void (*free)(struct mbstr_lua *mbstr);  // frees 'offsets' and 'data'
                                          // arrays
  int nclusters;
  int nbytes;
  int *offsets;  // (nclusters + 1) elements, offsets[nclusters] = nbytes
  char *data;    // (nbytes + 1) elements, data[nbytes] = 0
} mbstr_lua_t;

The field 'data' points to the byte values used to represent the string, while the field 'offsets' points to an array of offsets, indicating the starting position of each character (or "grapheme cluster" in Unicode). By doing things this way, C functions accessing the userdata do not need to be aware of the actual algorithm, which was used to determine character boundaries, and the time for accessing a character at a particular index remains constant (not dependent on the total string length).

On the "lua side" you can create the multi-byte string object by calling mbstr.from_singlebytestring(str) or mbstr.from_utf8(str). The latter function also accepts a table to select additional mappings to be done. It can contain the following fields:

- "stable"    (if true, then follow unicodes versioning stability)
- "compat"    (if true, then replace compatibility characters)
- "compose"   (if true, then compose characters into one codepoint if
               possible)
- "decompose" (if true, then decompose characters)
- "ignore"    (if true, then strip default ignorable characters)
- "rejectna"  (if true, then return nil for strings containing
               non-assigned code points)
- "nlf"       (convert NLF chars to LS, PS or LF
               ("ls", "ps", "lf" as value))
- "stripcc"   (if true, then strip control characters)
- "casefold"  (if true, then do a case-folding)
- "lump"      (if true, perform replacements according to lump.txt)
- "stripmark" (if true, then strip away marks
               (like accents, diaeresis, ...))

Note: mbstr.from_utf8 will return "nil" as a consequence of invalid input, e.g. invalid UTF-8 sequences, UTF-8 encoded 16 bit surrogates or incomplete grapheme clusters.

To easily include constant multi-byte string objects in your source, you could define a function like "u":

function u(str)
  mbstr.from_utf8(str, { stable = true, compose = true })
end
hello_world = u"Hello World!"

The options { stable = true, compose = true } are used to select NFC-normalization. In the example above the multi-byte string object is created at run-time, it will become inefficient, if you use such constructs in a loop.


As I said before, my approach doesn't solve all unicode problems. Using "userdata" for storing strings is rather a hack than a solution. (For example, I experienced trouble when using table.concat to concatenate the multi-byte strings.) The library is also incomplete yet, it doesn't support things like string.format, string.find, etc. Perhaps someone likes to extend it.

You can find the latest version of utf8proc (1.1.1) at:
http://www.flexiguided.de/pub/utf8proc-v1.1.1.tar.gz

The multibyte extension for lua is available at:
http://www.flexiguided.de/pub/lua-mbstr-v0.1.tar.gz

Please note that the lua extension has not been tested much and is dependent on the latest(!) version of libutf8proc.


I would like to ask, what is planned for the next versions of lua related to multi-byte charsets?


Jan Behrens

--
FlexiGuided GmbH
Johannistr. 12
10117 Berlin
Fon: [030] 9789.4550
Fax: [030] 9789.4551

www.flexiguided.de
info@flexiguided.de

Geschaeftsfuehrer:
Juergen Axel Kistner
Andreas Nitsche

Registergericht:
Amtsgericht Berlin-Charlottenburg

Registernummer:
HRB 97488 B

Ust-IdNr.:
DE814439999