lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Hash: SHA1

On 27/12/09 12:42, spir wrote:
> I'm building a unicode library. Basically, a UniString would be a real sequence of characters; which themselves mainly are defined by their code (point). Then, unistrings would have all typical string methods. (This is in contrast with common unicode string libraries that in fact provide --for me, useless-- methods on utf8 strings.)

The usual set of questions that need answering whenever you deal with
Unicode are:

- - when you say 'character', what precisely do you mean? (Unicode doesn't
have characters!)
- - are you going to support astral plane code points, or just BMP code
- - how are you going to index into strings? (The intuitive sense of
'character' don't actually match code points!)
- - what do you plan to do if the user tries to split a string at a point
where it's not supposed to be split (such as inside a composite glyph
made up of multiple code points)?
- - how are you going to compare two strings that contain the same text
but represented differently (e.g. 'letter y with umlaut' vs 'letter
y'+'umlaut accent')?

The reason why most libraries deal with UTF-8 only is that it's much,
much easier to work with UTF-8 than it is to deal with 'raw' Unicode ---
most simple problems vanish completely in UTF-8, and the complex ones
are no harder in UTF-8 than they are in raw Unicode. i.e., still very hard!

- -- 
┌─── ───── ─────
│ "Sufficiently advanced incompetence is indistinguishable from
│ malice." -- Vernon Schryver
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla -