ICU4Lua 0.13A - Regular Expressions, Collation and StringPrep/IDNA

lua-l archive
[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]
Subject: ICU4Lua 0.13A - Regular Expressions, Collation and StringPrep/IDNA
From: Duncan Cross <duncan.cross@...>
Date: Wed, 20 May 2009 19:25:25 +0100
Hi List,

Yet another new release of ICU4Lua. First of all, this one is based on
ICU 4.2, instead of 4.0 as in previous versions.

The LuaForge files:
<http://luaforge.net/frs/?group_id=460>

I made a change to the matching engine used by icu.ustring.match,
icu.utf8.match et al. By default, the character sets are still only in ASCII.
For example, %a matches only [a-zA-Z]. To match the full Unicode set of
letters, use %!a instead. This applies to all the other character classes as
well - prepend an exclamation mark to the character set letter to use the
Unicode version.

Below is a brief overview of the ICU functionality now wrapped by ICU4Lua. It
does not list all functions, just a subset of the most notable new ones. More
complete documentation is included in the release files.

====

icu.convert(string, current_encoding, new_encoding)
    Convert a Lua string encoded in one encoding to another.

icu.defaultencoding()
    The name of the default codepage as detected by ICU.

=========
Collation
=========
icu.collator.open(locale)
    Open a collator for the given locale, which must be a Lua string
      (e.g. "de").
    If the collator could not be opened, returns nil and an error message.

icu.collator.strength(col[, new_value])
    Either sets the strength of the collator, or returns the current strength
      setting if no new value is given.
    Valid strength values are:
        icu.collator.PRIMARY
        icu.collator.SECONDARY
        icu.collator.TERTIARY
        icu.collator.QUATERNARY
        icu.collator.IDENTICAL
        icu.collator.DEFAULT_STRENGTH

icu.collator.lessthan(col, a, b)
icu.collator.lessorequal(col, a, b)
icu.collator.equals(col, a, b)
    Functions for comparing ustrings a and b with the given collator.

===============
StringPrep/IDNA
===============
icu.stringprep.openbytype(type)
    Open a StringPrep profile object. type can be one of:
      icu.stringprep.RFC3491_NAMEPREP
      icu.stringprep.RFC3530_NFS4_CS_PREP
      icu.stringprep.RFC3530_NFS4_CS_PREP_CI
      icu.stringprep.RFC3530_NFS4_CIS_PREP
      icu.stringprep.NFS4_MIXED_PREP_PREFIX
      icu.stringprep.RFC3530_NFS4_MIXED_PREP_SUFFIX
      icu.stringprep.RFC3722_ISCSI
      icu.stringprep.RFC3920_NODEPREP
      icu.stringprep.RFC3920_RESOURCEPREP
      icu.stringprep.RFC4011_MIB
      icu.stringprep.RFC4013_SASLPREP
      icu.stringprep.RFC4505_TRACE
      icu.stringprep.RFC4518_LDAP
      icu.stringprep.RFC4518_LDAP_CI

icu.stringprep.prepare(profile, ustr)
    Prepare the given ustring according to the StringPrep profile.
    Returns either the prepared ustring, or nil and an error message.

icu.idna.toascii(ustr)
icu.idna.tounicode(ustr)
icu.idna.idntoascii(ustr)
icu.idna.idntounicode(ustr)
    International Domain Names for Applications transformation. All take
    and return a ustring. toascii and tounicode are for converting
    individual domain labels (e.g. "www", "lua" or "org") while
    idntoascii and idntounicode are for full domain names ("www.lua.org").

===================
Regular Expressions
===================
icu.regex.compile(pattern[, flags])
    Creates a new compiled regex pattern object. For details on the syntax
      supported by ICU, see <http://userguide.icu-project.org/strings/regexp>
    pattern can be a ustring or a Lua string. If a Lua string, it is expected
      to be encoded in the default codepage.
    Supported flags include:
        i  (Case insensitive)
        x  (Comments mode)
        s  (The dot '.' matches all characters including new lines)
        m  (Multiline mode)

icu.regex.match(regex, text[, start_index])
    Find the first place where the regex matches the given text, optionally
      starting the search at the given start index (one-based).
    The returned value is either false if no match was found, or a match
      object that contains these named fields:
    * value: The matching substring, as a ustring
    * start,
    * stop: Substring indices in the source text (one-based, inclusive)
    ...also, match[1] to match[n] are captures within the match, which have
      the same named fields. match[0] is the match itself again.

icu.regex.gmatch(regex, text)
    Returns an iterator over all of the matches found for a compiled regular
      expression, designed to be used in a for-loop:

    for match in icu.regex.gmatch(myRegex, inputText) do
        -- the "match" object is the same as described in the documentation
        -- for icu.regex.match()
    end

icu.regex.replace(regex, text, replacement)
    Find all places where the given regular expression matches in text,
      replace them with a new value, and return the result.
    text must be a ustring, and replacement must be one of the following:-
    * A ustring. You can use $0, $1, $2 etc to use captured substrings from
      the match (and $$ for a literal dollar sign).
    * A table. It will be indexed with the entire matching substring (as a
      ustring), and the value found must be either a new ustring or nil/false.
    * A function. It will be called with a single parameter, a match object
      as described in the documentation for icu.regex.match(). It must
      return either a ustring or nil/false.

icu.regex.split(regex, text[, maximum])
    Returns an array of the substrings found by splitting text (which must be
      a ustring) using the given regex, with an optional maximum number of
      splits.

====

That's it for now, thanks for reading.

-Duncan
Prev by Date: Re: [?? Probable Spam] Lua and Linear Programming
Next by Date: Lua for non-programmers
Previous by thread: Re: [?? Probable Spam] Lua and Linear Programming
Next by thread: Lua for non-programmers
Index(es):
- Date
- Thread