(unicode) design questions

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: (unicode) design questions
From: spir <denis.spir@...>
Date: Sun, 27 Dec 2009 13:42:21 +0100

Hello,

I'm building a unicode library. Basically, a UniString would be a real sequence of characters; which themselves mainly are defined by their code (point). Then, unistrings would have all typical string methods. (This is in contrast with common unicode string libraries that in fact provide --for me, useless-- methods on utf8 strings.)

>From an OO background, I've written a UniChar type, that does the job. Now, I'm wondering whether this is not overkill. Maybe implementing UniStrings as sequences of plain codes (ints) would do the job --at least in most cases? Dunno...

Advantages of UniChar type I can imagine:
* naming (eg "letter A")
* sensible output (eg: "letter y with umlaut: #ff" --in particuliar, code in hex!)
* information (eg isLetter) -- possibly retrieved from unicode databanks
Current methods of UniChar:
* clone (__call): new unichar from code
* view (__tostring), show (print view with name if available)
* equals (__eq)
* decode/encode (as of now, only from/to utf8 string representing a single char)
Actually, these methods may well be plain funcs working on or creating plain codes.

Also, in both cases, I wonder whether it's worth coding characters/codes in C.
* In case of plain code: have a 32-bit unsigned int
* In case of UniChar: have a fast implementation of time-consuming tasks (esp. utf8 decode/encode).

Hints welcome.

Also, comments welcome on the list of methods UniString (half of them already written) (see below file header). 

Denis

PS: decided to call the package 'lunistring' ;-)
________________________________

la vita e estrany

http://spir.wikidot.com/

================================================
--[[ type   U n i S t r i n g
    
    unicode character string
    basically a sequence of UniChar's, with string methods
    
    UniStrings show as a list of codes, eg "61 20 09 e9 ff 100 ffff 10ffff".
    TODO: UniStrings can be built from kinds of literals using
    hex codes '\xxxx' (up to 8 digits), like lua strings.
    
    content:
        ~ chars()               iterator on chars
        ~ char(i?)              --> char, last by default
        ~ size()                count?
        ~ holds(char/str)       --> logical
        ~ findfirst(char/str)   --> position/range
        ~ count(char/str)       --> positions/ranges
        ~ equals(unistring2)    __eq: --> logical
    --> pairs & ipairs also work!
    
    encode/decode to/from text:
        encode(encoding)                    --> lua string
        UniString.decode(string, encoding)  --> UniChar
        --> use UniChar to/from UTF8/16/32 methods

    modification:
        ~ put(i?)               add char, end by default (=push)
        ~ change(i?)            last by default
        ~ remove(i?)            last by default
        ~ replace(char/str)     != change!
    
    new UniString:
        ~ clone(literal?)       __call: (TODO: from literal)
        ~ concat(c2)            __concat
        ~ slice(i1,i2)
        ~ multiply(n)           (<=> string.rep)
    
    higher order functions:
        ~ map(func)
        ~ filter(func)
    
    output:
        ~ view()                __tostring: list of codes
        ~ show()                write   name: view

    TODO:
        ~ clone from literal
        ~ findlast/findall
        ~ trim/lefttrim/righttrim

    possible TODO:
        ~ prototype --> subtype
        ~ startswith/endswith --> logical
        ~ sort (using custom sort func?) --> new UniString
        ~ find, count, & replace with regex?
        ~ replace using func?
-- ]]

Follow-Ups:
- Re: (unicode) design questions, David Given
- Re: (unicode) design questions, Klaus Ripke

Prev by Date: Re: Lua FAQ update
Next by Date: Re: Pattern matching: good practices ?
Previous by thread: Re: 回复： A bug in pattern matching ??
Next by thread: Re: (unicode) design questions
Index(es):
- Date
- Thread