|
I've written the attached. Any comments that will lead to the last sentence in it becoming less painful will be greatly appreciated.
LPEG is a library for operating on patterns. It is not an alternative to the Lua string library, but it can be used to implement libraries rather like the string library instead of doing it directly in C.
Some words mean very specific things to LPEG.
A Lua string, consisting of an arbitrary sequence of bytes. On many systems, one could say "characters" instead.
A userdata type, containing enough information to characterize a certain property that a string might have. The properties that can be described are rather like the questions solvers of crossword puzzles tend to ask. For example: "seven letters, starts with 'p', has a double letter in it somewere".
A string that is presented for examination to an LPEG function. Its bytes are referred to as input
.
Not the same as string.match
.
The central function of LPEG.
A pattern succeeds when it matches a substring of the subject at the point where it is applied.
A pattern fails when it does not match any substring at that point.
Individually accessible portions of a match.
A pattern consumes its match if that portion of the subject is not available to any follow-up pattern except a backspace pattern.
The most spectacular feature of LPEG is the way complicated patterns are built up from simpler ones: Patterns can be used instead of numbers as the values that the variables in an arithmetic _expression_ may take. For example, "x^2+3*x-13" is a perfectly valid LPEG _expression_ when x
is a pattern.
Roberto Ierusalimschy put a lot of thought into allocating useful meanings to the arithmetic operations and constants of Lua. In particular, the priority of operations is such that one often needs no parentheses. Some words of warning, though:
3
or true
in a pattern _expression_, they cannot be combined with each other, only with existing patterns. To make sure, convert them to patterns first by applying lpeg.P
(the lpeg.
is used only here, it will be just P
later).This goes without saying, but I am nevertheless saying it.
true
stands for a pattern that always succeeds, false
for a pattern that always fails.
A string stands for a pattern that matches only that exact string.
0
stands for a pattern that matches the empty string, positive n
for a pattern that matches exactly n
bytes, -n
for the negation of n
(see below).
-p
succeeds when p
fails. It consumes no input. The idiom -1
matches only the end of the subject.
Suppose p
and q
respectively match a
and b
, then p*q
matches a..b
. Note that multiplication is not commutative.
or
.p+q
matches what p
matches, except when p
fails; then it matches what q
matches. Note that p+q
succeeds if and only if q+p
succeeds, but if both p
and q
would succeed, the match is that of the first pattern. So addition is not quite commutative.
and not
.p-q
fails if q
succeeds, otherwise matches what p
matches. Note that 0-p
does the same as -p
, but p-p
does not do the same as 0
, it does the same as false
.
p/s
matches what p
does, but processes the captures of p
as specified by s
. There are many variations, for example if p
itself contains no captures, p/1
creates a capture consisting of the substring matched by p
.
Not quite exponentiation in the usual sense: p*p
means exactly two repetitions of p
, which is not the same as p^2
.
p^n
matches n
or more repetitions of p
.p^0
matches any number of repetitions of p
, including the empty string.p^-n
matches not more than n
repetitions of p
.#p
matches what p
matches, but consumes no input. A common idiom: #p*q
matches what q
matches, provided that p
succeeds.
For example, suppose x=P"abc"
. Then x^2+3*x-13
means "two or more copies of abc
, or any three bytes followed by abc
, but not 13 or more bytes long".
The introductory lpeg.
has been omitted here.
P
for PatternApart from nil, boolean, string and number, discussed above, functions and tables can also be converted to patterns, but these are too advanced to discuss here. Existing patterns are unchanged.
R
for RangeR(r)
, where r
is a two-byte string, matches any byte whose internal numerical code is in the range r:byte(1,2)
. You could use characters in r
, e.g. R"az" on most systems matches the range of lowercase letters, but see locale
for a more portable alternative.
S
for SetS"()[]" matches any of lpeg.match
is the name of a function that for a given pattern and subject determines whether there is a substring subject:sub(init,stop-1)
matched by the pattern. It returns the value stop
, or nil
if the pattern does not match.
locale
locale()
returns a table of patterns that match character classes. Recommended method is to examine the keys of the returned table, e.g. locale().lower
matches all lower-case letters.
When called with p
as first argument, these functions will replace p
by P(p)
before proceeding. When p
is already a pattern, hey can also be called the object-oriented way shown below.
match(p,subject[,init]
, p:match(subject[,init])
Tries to match p
to subject:sub(init)
. init
defaults to 1. Returns the captures of the match, or if none specified, the index of the first byte in the subject after the match.
B(p)
, p:B()
for BackspaceMatches what p
does, but matches just before the current position in the subject instead of at it and consumes no input. p
is restricted to patterns of fixed length that make no captures.
C(p)
, p:C()
for CaptureMatches what p
does, and returns a capture of the match.
There are several specialized constructors and methods dealing with captures, many of which involve variations of the division operator.
P
acting on tables and functions.
The LPEG distribution also contains re.lua
, an application demonstrating the feasibility of writing a regular _expression_ handler using LPEG.
The present author is not yet qualified to write about these.
About this document: Dirk Laurie wrote it in order to teach himself LPEG. All errors in it can be blamed on his inexperience.