[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: ICU4Lua 0.13A - Regular Expressions, Collation and StringPrep/IDNA
- From: Duncan Cross <duncan.cross@...>
- Date: Wed, 20 May 2009 19:25:25 +0100
Hi List,
Yet another new release of ICU4Lua. First of all, this one is based on
ICU 4.2, instead of 4.0 as in previous versions.
The LuaForge files:
<http://luaforge.net/frs/?group_id=460>
I made a change to the matching engine used by icu.ustring.match,
icu.utf8.match et al. By default, the character sets are still only in ASCII.
For example, %a matches only [a-zA-Z]. To match the full Unicode set of
letters, use %!a instead. This applies to all the other character classes as
well - prepend an exclamation mark to the character set letter to use the
Unicode version.
Below is a brief overview of the ICU functionality now wrapped by ICU4Lua. It
does not list all functions, just a subset of the most notable new ones. More
complete documentation is included in the release files.
====
icu.convert(string, current_encoding, new_encoding)
Convert a Lua string encoded in one encoding to another.
icu.defaultencoding()
The name of the default codepage as detected by ICU.
=========
Collation
=========
icu.collator.open(locale)
Open a collator for the given locale, which must be a Lua string
(e.g. "de").
If the collator could not be opened, returns nil and an error message.
icu.collator.strength(col[, new_value])
Either sets the strength of the collator, or returns the current strength
setting if no new value is given.
Valid strength values are:
icu.collator.PRIMARY
icu.collator.SECONDARY
icu.collator.TERTIARY
icu.collator.QUATERNARY
icu.collator.IDENTICAL
icu.collator.DEFAULT_STRENGTH
icu.collator.lessthan(col, a, b)
icu.collator.lessorequal(col, a, b)
icu.collator.equals(col, a, b)
Functions for comparing ustrings a and b with the given collator.
===============
StringPrep/IDNA
===============
icu.stringprep.openbytype(type)
Open a StringPrep profile object. type can be one of:
icu.stringprep.RFC3491_NAMEPREP
icu.stringprep.RFC3530_NFS4_CS_PREP
icu.stringprep.RFC3530_NFS4_CS_PREP_CI
icu.stringprep.RFC3530_NFS4_CIS_PREP
icu.stringprep.NFS4_MIXED_PREP_PREFIX
icu.stringprep.RFC3530_NFS4_MIXED_PREP_SUFFIX
icu.stringprep.RFC3722_ISCSI
icu.stringprep.RFC3920_NODEPREP
icu.stringprep.RFC3920_RESOURCEPREP
icu.stringprep.RFC4011_MIB
icu.stringprep.RFC4013_SASLPREP
icu.stringprep.RFC4505_TRACE
icu.stringprep.RFC4518_LDAP
icu.stringprep.RFC4518_LDAP_CI
icu.stringprep.prepare(profile, ustr)
Prepare the given ustring according to the StringPrep profile.
Returns either the prepared ustring, or nil and an error message.
icu.idna.toascii(ustr)
icu.idna.tounicode(ustr)
icu.idna.idntoascii(ustr)
icu.idna.idntounicode(ustr)
International Domain Names for Applications transformation. All take
and return a ustring. toascii and tounicode are for converting
individual domain labels (e.g. "www", "lua" or "org") while
idntoascii and idntounicode are for full domain names ("www.lua.org").
===================
Regular Expressions
===================
icu.regex.compile(pattern[, flags])
Creates a new compiled regex pattern object. For details on the syntax
supported by ICU, see <http://userguide.icu-project.org/strings/regexp>
pattern can be a ustring or a Lua string. If a Lua string, it is expected
to be encoded in the default codepage.
Supported flags include:
i (Case insensitive)
x (Comments mode)
s (The dot '.' matches all characters including new lines)
m (Multiline mode)
icu.regex.match(regex, text[, start_index])
Find the first place where the regex matches the given text, optionally
starting the search at the given start index (one-based).
The returned value is either false if no match was found, or a match
object that contains these named fields:
* value: The matching substring, as a ustring
* start,
* stop: Substring indices in the source text (one-based, inclusive)
...also, match[1] to match[n] are captures within the match, which have
the same named fields. match[0] is the match itself again.
icu.regex.gmatch(regex, text)
Returns an iterator over all of the matches found for a compiled regular
expression, designed to be used in a for-loop:
for match in icu.regex.gmatch(myRegex, inputText) do
-- the "match" object is the same as described in the documentation
-- for icu.regex.match()
end
icu.regex.replace(regex, text, replacement)
Find all places where the given regular expression matches in text,
replace them with a new value, and return the result.
text must be a ustring, and replacement must be one of the following:-
* A ustring. You can use $0, $1, $2 etc to use captured substrings from
the match (and $$ for a literal dollar sign).
* A table. It will be indexed with the entire matching substring (as a
ustring), and the value found must be either a new ustring or nil/false.
* A function. It will be called with a single parameter, a match object
as described in the documentation for icu.regex.match(). It must
return either a ustring or nil/false.
icu.regex.split(regex, text[, maximum])
Returns an array of the substrings found by splitting text (which must be
a ustring) using the given regex, with an optional maximum number of
splits.
====
That's it for now, thanks for reading.
-Duncan