Lpeg Recipes

lua-users home
wiki

Lua recipes for LPeg (LuaPeg), a new pattern-matching library for Lua.

See LpegTutorial for an introduction.

Place examples of Lua code using LPeg for parsing to help further the understanding of how to use parsing expression grammars.

Number Patterns

Written by Caleb Place of Gymbyl Coding

A table of number patterns to use for matching.

local number = {}

local digit = R("09")

-- Matches: 10, -10, 0
number.integer =
	(S("+-") ^ -1) *
	(digit   ^  1)

-- Matches: .6, .899, .9999873
number.fractional =
	(P(".")   ) *
	(digit ^ 1)

-- Matches: 55.97, -90.8, .9 
number.decimal =	
	(number.integer *              -- Integer
	(number.fractional ^ -1)) +    -- Fractional
	(S("+-") * number.fractional)  -- Completely fractional number

-- Matches: 60.9e07, 9e-4, 681E09 
number.scientific = 
	number.decimal * -- Decimal number
	S("Ee") *        -- E or e
	number.integer   -- Exponent

-- Matches all of the above
number.number =
	number.decimal + number.scientific -- Decimal number allows for everything else, and scientific matches scientific

C Comment Parser

local BEGIN_COMMENT = lpeg.P("/*")
local END_COMMENT = lpeg.P("*/")
local NOT_BEGIN = (1 - BEGIN_COMMENT)^0
local NOT_END = (1 - END_COMMENT)^0
local FULL_COMMENT_CONTENTS = BEGIN_COMMENT * NOT_END * END_COMMENT

-- Parser to find comments from a string
local searchParser = (NOT_BEGIN * lpeg.C(FULL_COMMENT_CONTENTS))^0
-- Parser to find non-comments from a string
local filterParser = (lpeg.C(NOT_BEGIN) * FULL_COMMENT_CONTENTS)^0 * lpeg.C(NOT_BEGIN)

-- Simpler version, although empirically it is slower.... (why?) ... any optimization
-- suggestions are desired as well as optimum integration w/ C++ comments and other
-- syntax elements
local searchParser = (lpeg.C(FULL_COMMENT_CONTENTS) + 1)^0
-- Suggestion by Roberto to make the search faster
-- Works because it loops fast over all non-slashes, then it begins the slower match phase
local searchParser = ((1 - lpeg.P"/")^0 * (lpeg.C(FULL_COMMENT_CONTENTS) + 1))^0

Evaluate Standard Roman Numerals

The numeral is given in the variable text.
do
local add = function (x,y) return x+y end
local P,Ca,Cc= lpeg.P,lpeg.Ca,lpeg.Cc
local symbols = { I=1,V=5,X=10,L=50,C=100,D=500,M=1000,
   IV=4,IX=9,XL=40,XC=90,CD=400,CM=900}
local env = getfenv(1)
for s,n in pairs(symbols) do env[s:lower()] = P(s)*Cc(n)/add end
setfenv(1,env)
local MS = m^0
local CS = (d*c^(-4)+cd+cm+c^(-4))^(-1)
local XS = (l*x^(-4)+xl+xc+x^(-4))^(-1)
local IS = (v*i^(-4)+ix+iv+i^(-4))^(-1)
local p = Ca(Cc(0)*MS*CS*XS*IS)
local result = p:match(text:upper())
print(result or "?")
end

Match Sequences of Consecutive Integers

Needs Lpeg version 8.
do
local C,Cb,Cmt,R,S = lpeg.C,lpeg.Cb,lpeg.Cmt,lpeg.R,lpeg.S
local some = function (p) return (p+1)^1 end
local digit,space = R "09",S " "
local num = C(digit^1)/tonumber
local check = Cmt(Cb(1)*num,function (s,i,x,y)
      if y == x+1 then return i,y end end)
local monotone = some(C(num*(space^1*check)^0))
local m = monotone:match(text)
print (m or "?")
end

Match a list of integers or ranges

Recognise a list of integer values or ranges of integer values.

Return a table containing

Return nil if no integer values or ranges are found

Examples

local re = require 're'

local list_parser = re.compile [[
   list <- ( singleint_or_range ( ',' singleint_or_range ) * ) -> {}
   singleint_or_range <- range / singleint
   singleint <- { int } -> {}
   range <- ( { int } '-' { int } ) -> {}
   int <- %d+
]]

local function parse_list(list_string)
   local t = list_parser:match(list_string)
   -- further processing to remove overlaps, duplicates, sort into ascending order, etc
   return t
end

Match a fixed number of repetitions of a pattern

Matching a precise number of occurrences of a certain pattern. -- ValeriuPalos?
function multiply_pattern(item, count)
    return lpeg.Cmt(lpeg.P(true),
               function(s, i)
                   local set, offset = {}, i
                   for j = 1, count do
                       set[j], offset = lpeg.match(item * lpeg.Cp(), s, offset)
                       if not offset then
                           return false
                       end
                   end
                   return offset, set
               end)
end

A detailed explanation [is described here] along with a method to match between a minimum and a maximum number of pattern occurrences.

Lua Lexer

This is a Lua lexer in LPeg. The original author is PeterOdding. This lexer eventually became [LXSH] which includes Lua and C lexers and syntax highlighters.

--[[

= ABOUT
This module uses Roberto Ierusalimschy's powerful new pattern matching library
LPeg[1] to tokenize Lua source-code in to a table of tokens. I think it handles
all of Lua's syntax, but if you find anything missing I would appreciate a mail
at peter@peterodding.com. This lexer is based on the BNF[2] from the Lua manual.

= USAGE
I've saved my copy of this module under [$LUA_PATH/lexers/lua.lua] which means
I can use it like in the following interactive prompt:

   Lua 5.1.1  Copyright (C) 1994-2006 Lua.org, PUC-Rio
   > require 'lexers.lua'
   > tokens = lexers.lua [=[
   >> 42 or 0
   >> -- some Lua source-code in a string]=]
   > = tokens
   table: 00422E40
   > lexers.lua.print(tokens)
   line 1, number: `42`
   line 1, whitespace: ` `
   line 1, keyword: `or`
   line 1, whitespace: ` `
   line 1, number: `0`
   line 1, whitespace: `
   `
   line 2, comment: `-- some Lua source-code in a string`
   total of 7 tokens, 2 lines

The returned table [tokens] looks like this:

{
   -- type       , text, line
   { 'number'    , '42', 1 },
   { 'whitespace', ' ' , 1 },
   { 'keyword'   , 'or', 1 },
   { 'whitespace', ' ' , 1 },
   { 'number'    , '0' , 1 },
   { 'whitespace', '\n', 1 },
   { 'comment'   , '-- some Lua source-code in a string', 2 },
}

= CREDITS
Written by Peter Odding, 2007/04/04

= THANKS TO
- the Lua authors for a wonderful language;
- Roberto for LPeg;
- caffeine for keeping me awake :)

= LICENSE
Shamelessly ripped from the SQLite[3] project:

   The author disclaims copyright to this source code.  In place of a legal
   notice, here is a blessing:

      May you do good and not evil.
      May you find forgiveness for yourself and forgive others.
      May you share freely, never taking more than you give.

[1] http://www.inf.puc-rio.br/~roberto/lpeg.html
[2] http://lua.org/manual/5.1/manual.html#8
[3] http://sqlite.org

--]]

-- since this module is intended to be loaded with require() we receive the
-- name used to load us in ... and pass it on to module()
module(..., package.seeall)

-- written for LPeg .5, by the way
local lpeg = require 'lpeg'
local P, R, S, C, Cc, Ct = lpeg.P, lpeg.R, lpeg.S, lpeg.C, lpeg.Cc, lpeg.Ct

-- create a pattern which captures the lua value [id] and the input matching
-- [patt] in a table
local function token(id, patt) return Ct(Cc(id) * C(patt)) end

local digit = R('09')

-- range of valid characters after first character of identifier
local idsafe = R('AZ', 'az', '\127\255') + P '_'

-- operators
local operator = token('operator', P '==' + P '~=' + P '<=' + P '>=' + P '...'
                                          + P '..' + S '+-*/%^#=<>;:,.{}[]()')
-- identifiers
local ident = token('identifier', idsafe * (idsafe + digit + P '.') ^ 0)

-- keywords
local keyword = token('keyword', (P 'and' + P 'break' + P 'do' + P 'else' +
   P 'elseif' + P 'end' + P 'false' + P 'for' + P 'function' + P 'if' +
   P 'in' + P 'local' + P 'nil' + P 'not' + P 'or' + P 'repeat' + P 'return' +
   P 'then' + P 'true' + P 'until' + P 'while') * -(idsafe + digit))

-- numbers
local number_sign = S'+-'^-1
local number_decimal = digit ^ 1
local number_hexadecimal = P '0' * S 'xX' * R('09', 'AF', 'af') ^ 1
local number_float = (digit^1 * P'.' * digit^0 + P'.' * digit^1) *
                     (S'eE' * number_sign * digit^1)^-1
local number = token('number', number_hexadecimal +
                               number_float +
                               number_decimal)

-- callback for [=[ long strings ]=]
-- ps. LPeg is for Lua what regex is for Perl, which makes me smile :)
local longstring = #(P '[[' + (P '[' * P '=' ^ 0 * P '['))
local longstring = longstring * P(function(input, index)
   local level = input:match('^%[(=*)%[', index)
   if level then
      local _, stop = input:find(']' .. level .. ']', index, true)
      if stop then return stop + 1 end
   end
end)

-- strings
local singlequoted_string = P "'" * ((1 - S "'\r\n\f\\") + (P '\\' * 1)) ^ 0 * "'"
local doublequoted_string = P '"' * ((1 - S '"\r\n\f\\') + (P '\\' * 1)) ^ 0 * '"'
local string = token('string', singlequoted_string +
                               doublequoted_string +
                               longstring)

-- comments
local singleline_comment = P '--' * (1 - S '\r\n\f') ^ 0
local multiline_comment = P '--' * longstring
local comment = token('comment', multiline_comment + singleline_comment)

-- whitespace
local whitespace = token('whitespace', S('\r\n\f\t ')^1)

-- ordered choice of all tokens and last-resort error which consumes one character
local any_token = whitespace + number + keyword + ident +
                  string + comment + operator + token('error', 1)

-- private interface
local table_of_tokens = Ct(any_token ^ 0)

-- increment [line] by the number of line-ends in [text]
local function sync(line, text)
   local index, limit = 1, #text
   while index <= limit do
      local start, stop = text:find('\r\n', index, true)
      if not start then
         start, stop = text:find('[\r\n\f]', index)
         if not start then break end
      end
      index = stop + 1
      line = line + 1
   end
   return line
end

-- we only need to synchronize the line-counter for these token types
local multiline_tokens = { comment = true, string = true, whitespace = true }

-- public interface
getmetatable(getfenv(1)).__call = function(self, input)
   assert(type(input) == 'string', 'bad argument #1 (expected string)')
   local line = 1
   local tokens = lpeg.match(table_of_tokens, input)
   for i, token in pairs(tokens) do
      token[3] = line
      if multiline_tokens[token[1]] then line = sync(line, token[2]) end
   end
   return tokens
end

-- if you really want to try it out before writing any code :P
function print(tokens)
   local print, format = _G.print, _G.string.format
   for _, token in pairs(tokens) do
      print(format('line %i, %s: `%s`', token[3], token[1], token[2]))
   end
   print(format('total of %i tokens, %i lines', #tokens, tokens[#tokens][3]))
end

Lua Parser

A Lua 5.1 parser in LPeg. Improvements welcome. -- Patrick Donnelly (batrick)

local lpeg = require "lpeg";

local locale = lpeg.locale();

local P, S, V = lpeg.P, lpeg.S, lpeg.V;

local C, Cb, Cc, Cg, Cs, Cmt =
    lpeg.C, lpeg.Cb, lpeg.Cc, lpeg.Cg, lpeg.Cs, lpeg.Cmt;

local shebang = P "#" * (P(1) - P "\n")^0 * P "\n";

local function K (k) -- keyword
  return P(k) * -(locale.alnum + P "_");
end

local lua = P {
  (shebang)^-1 * V "space" * V "chunk" * V "space" * -P(1);

  -- keywords

  keywords = K "and" + K "break" + K "do" + K "else" + K "elseif" +
             K "end" + K "false" + K "for" + K "function" + K "if" +
             K "in" + K "local" + K "nil" + K "not" + K "or" + K "repeat" +
             K "return" + K "then" + K "true" + K "until" + K "while";

  -- longstrings

  longstring = P { -- from Roberto Ierusalimschy's lpeg examples
    V "open" * C((P(1) - V "closeeq")^0) *
        V "close" / function (o, s) return s end;

    open = "[" * Cg((P "=")^0, "init") * P "[" * (P "\n")^-1;
    close = "]" * C((P "=")^0) * "]";
    closeeq = Cmt(V "close" * Cb "init", function (s, i, a, b) return a == b end)
  };

  -- comments & whitespace

  comment = P "--" * V "longstring" +
            P "--" * (P(1) - P "\n")^0 * (P "\n" + -P(1));

  space = (locale.space + V "comment")^0;

  -- Types and Comments

  Name = (locale.alpha + P "_") * (locale.alnum + P "_")^0 - V "keywords";
  Number = (P "-")^-1 * V "space" * P "0x" * locale.xdigit^1 *
               -(locale.alnum + P "_") +
           (P "-")^-1 * V "space" * locale.digit^1 *
               (P "." * locale.digit^1)^-1 * (S "eE" * (P "-")^-1 *
                   locale.digit^1)^-1 * -(locale.alnum + P "_") +
           (P "-")^-1 * V "space" * P "." * locale.digit^1 *
               (S "eE" * (P "-")^-1 * locale.digit^1)^-1 *
               -(locale.alnum + P "_");
  String = P "\"" * (P "\\" * P(1) + (1 - P "\""))^0 * P "\"" +
           P "'" * (P "\\" * P(1) + (1 - P "'"))^0 * P "'" +
           V "longstring";

  -- Lua Complete Syntax

  chunk = (V "space" * V "stat" * (V "space" * P ";")^-1)^0 *
              (V "space" * V "laststat" * (V "space" * P ";")^-1)^-1;

  block = V "chunk";

  stat = K "do" * V "space" * V "block" * V "space" * K "end" +
         K "while" * V "space" * V "exp" * V "space" * K "do" * V "space" *
             V "block" * V "space" * K "end" +
         K "repeat" * V "space" * V "block" * V "space" * K "until" *
             V "space" * V "exp" +
         K "if" * V "space" * V "exp" * V "space" * K "then" *
             V "space" * V "block" * V "space" *
             (K "elseif" * V "space" * V "exp" * V "space" * K "then" *
              V "space" * V "block" * V "space"
             )^0 *
             (K "else" * V "space" * V "block" * V "space")^-1 * K "end" +
         K "for" * V "space" * V "Name" * V "space" * P "=" * V "space" *
             V "exp" * V "space" * P "," * V "space" * V "exp" *
             (V "space" * P "," * V "space" * V "exp")^-1 * V "space" *
             K "do" * V "space" * V "block" * V "space" * K "end" +
         K "for" * V "space" * V "namelist" * V "space" * K "in" * V "space" *
             V "explist" * V "space" * K "do" * V "space" * V "block" *
             V "space" * K "end" +
         K "function" * V "space" * V "funcname" * V "space" *  V "funcbody" +
         K "local" * V "space" * K "function" * V "space" * V "Name" *
             V "space" * V "funcbody" +
         K "local" * V "space" * V "namelist" *
             (V "space" * P "=" * V "space" * V "explist")^-1 +
         V "varlist" * V "space" * P "=" * V "space" * V "explist" +
         V "functioncall";

  laststat = K "return" * (V "space" * V "explist")^-1 + K "break";

  funcname = V "Name" * (V "space" * P "." * V "space" * V "Name")^0 *
      (V "space" * P ":" * V "space" * V "Name")^-1;

  namelist = V "Name" * (V "space" * P "," * V "space" * V "Name")^0;

  varlist = V "var" * (V "space" * P "," * V "space" * V "var")^0;

  -- Let's come up with a syntax that does not use left recursion
  -- (only listing changes to Lua 5.1 extended BNF syntax)
  -- value ::= nil | false | true | Number | String | '...' | function |
  --           tableconstructor | functioncall | var | '(' exp ')'
  -- exp ::= unop exp | value [binop exp]
  -- prefix ::= '(' exp ')' | Name
  -- index ::= '[' exp ']' | '.' Name
  -- call ::= args | ':' Name args
  -- suffix ::= call | index
  -- var ::= prefix {suffix} index | Name
  -- functioncall ::= prefix {suffix} call

  -- Something that represents a value (or many values)
  value = K "nil" +
          K "false" +
          K "true" +
          V "Number" +
          V "String" +
          P "..." +
          V "function" +
          V "tableconstructor" +
          V "functioncall" +
          V "var" +
          P "(" * V "space" * V "exp" * V "space" * P ")";

  -- An expression operates on values to produce a new value or is a value
  exp = V "unop" * V "space" * V "exp" +
        V "value" * (V "space" * V "binop" * V "space" * V "exp")^-1;

  -- Index and Call
  index = P "[" * V "space" * V "exp" * V "space" * P "]" +
          P "." * V "space" * V "Name";
  call = V "args" +
         P ":" * V "space" * V "Name" * V "space" * V "args";

  -- A Prefix is a the leftmost side of a var(iable) or functioncall
  prefix = P "(" * V "space" * V "exp" * V "space" * P ")" +
           V "Name";
  -- A Suffix is a Call or Index
  suffix = V "call" +
           V "index";

  var = V "prefix" * (V "space" * V "suffix" * #(V "space" * V "suffix"))^0 *
            V "space" * V "index" +
        V "Name";
  functioncall = V "prefix" *
                     (V "space" * V "suffix" * #(V "space" * V "suffix"))^0 *
                 V "space" * V "call";

  explist = V "exp" * (V "space" * P "," * V "space" * V "exp")^0;

  args = P "(" * V "space" * (V "explist" * V "space")^-1 * P ")" +
         V "tableconstructor" +
         V "String";

  ["function"] = K "function" * V "space" * V "funcbody";

  funcbody = P "(" * V "space" * (V "parlist" * V "space")^-1 * P ")" *
                 V "space" *  V "block" * V "space" * K "end";

  parlist = V "namelist" * (V "space" * P "," * V "space" * P "...")^-1 +
            P "...";

  tableconstructor = P "{" * V "space" * (V "fieldlist" * V "space")^-1 * P "}";

  fieldlist = V "field" * (V "space" * V "fieldsep" * V "space" * V "field")^0
                  * (V "space" * V "fieldsep")^-1;

  field = P "[" * V "space" * V "exp" * V "space" * P "]" * V "space" * P "=" *
              V "space" * V "exp" +
          V "Name" * V "space" * P "=" * V "space" * V "exp" +
          V "exp";

  fieldsep = P "," +
             P ";";

  binop = K "and" + -- match longest token sequences first
          K "or" +
          P ".." +
          P "<=" +
          P ">=" +
          P "==" +
          P "~=" +
          P "+" +
          P "-" +
          P "*" +
          P "/" +
          P "^" +
          P "%" +
          P "<" +
          P ">";

  unop = P "-" +
         P "#" +
         K "not";
};

Also see LuaFish, Leg[1], or the [Lua parser in trolledit].

C Lexer

This lexes ANSI C. Improvements welcome. --DavidManura

-- Lua LPeg lexer for C.
-- Note:
--   Does not handle C preprocessing macros.
--   Not well tested.
-- 
-- David Manura, 2007, public domain.  Based on ANSI C Lex
--   specification in http://www.quut.com/c/ANSI-C-grammar-l-1998.html
--   (Jutta Degener, 2006; Tom Stockfisch, 1987, Jeff Lee, 1985)

local lpeg = require 'lpeg'

local P, R, S, C =
  lpeg.P, lpeg.R, lpeg.S, lpeg.C

local whitespace = S' \t\v\n\f'

local digit = R'09'
local letter = R('az', 'AZ') + P'_'
local alphanum = letter + digit
local hex = R('af', 'AF', '09')
local exp = S'eE' * S'+-'^-1 * digit^1
local fs = S'fFlL'
local is = S'uUlL'^0

local hexnum = P'0' * S'xX' * hex^1 * is^-1
local octnum = P'0' * digit^1 * is^-1
local decnum = digit^1 * is^-1
local floatnum = digit^1 * exp * fs^-1 +
                 digit^0 * P'.' * digit^1 * exp^-1 * fs^-1 +
                 digit^1 * P'.' * digit^0 * exp^-1 * fs^-1
local numlit = hexnum + octnum + floatnum + decnum

local charlit =
  P'L'^-1 * P"'" * (P'\\' * P(1) + (1 - S"\\'"))^1 * P"'"

local stringlit =
  P'L'^-1 * P'"' * (P'\\' * P(1) + (1 - S'\\"'))^0 * P'"'

local ccomment = P'/*' * (1 - P'*/')^0 * P'*/'
local newcomment = P'//' * (1 - P'\n')^0
local comment = (ccomment + newcomment)
              / function(...) print('COMMENT', ...) end

local literal = (numlit + charlit + stringlit)
              / function(...) print('LITERAL', ...) end

local keyword = C(
  P"auto" + 
  P"_Bool" +
  P"break" +
  P"case" +
  P"char" +
  P"_Complex" +
  P"const" +
  P"continue" +
  P"default" +
  P"do" +
  P"double" +
  P"else" +
  P"enum" +
  P"extern" +
  P"float" +
  P"for" +
  P"goto" +
  P"if" +
  P"_Imaginary" +
  P"inline" +
  P"int" +
  P"long" +
  P"register" +
  P"restrict" +
  P"return" +
  P"short" +
  P"signed" +
  P"sizeof" +
  P"static" +
  P"struct" +
  P"switch" +
  P"typedef" +
  P"union" +
  P"unsigned" +
  P"void" +
  P"volatile" +
  P"while"
) / function(...) print('KEYWORD', ...) end

local identifier = (letter * alphanum^0 - keyword * (-alphanum))
                 / function(...) print('ID',...) end

local op = C(
  P"..." +
  P">>=" +
  P"<<=" +
  P"+=" +
  P"-=" +
  P"*=" +
  P"/=" +
  P"%=" +
  P"&=" +
  P"^=" +
  P"|=" +
  P">>" +
  P"<<" +
  P"++" +
  P"--" +
  P"->" +
  P"&&" +
  P"||" +
  P"<=" +
  P">=" +
  P"==" +
  P"!=" +
  P";" +
  P"{" + P"<%" +
  P"}" + P"%>" +
  P"," +
  P":" +
  P"=" +
  P"(" +
  P")" +
  P"[" + P"<:" +
  P"]" + P":>" +
  P"." +
  P"&" +
  P"!" +
  P"~" +
  P"-" +
  P"+" +
  P"*" +
  P"/" +
  P"%" +
  P"<" +
  P">" +
  P"^" +
  P"|" +
  P"?"
) / function(...) print('OP', ...) end

local tokens = (comment + identifier + keyword +
                literal + op + whitespace)^0

-- frontend
local filename = arg[1]
local fh = assert(io.open(filename))
local input = fh:read'*a'
fh:close()
print(lpeg.match(tokens, input))

~~ ThomasHarningJr : Suggestion for optimization of the 'op' matcher in the C preprocessor... This should be faster due to the use of sets instead of making tons of 'basic' string comparisons. Not sure 'how' much faster...

local shiftOps = P">>" + P"<<"
local digraphs = P"<%" + P"%>" + P"<:" + P":>" -- {, }, [, ]
local op = C(
-- First match the multi-char items
  P"..." +
  ((shiftOps + S("+-*/%&^|<>=!")) * P"=") +
  shiftOps +
  P"++" +
  P"--" +
  P"&&" +
  P"||" +
  P"->" +
  digraphs +
  S(";{},:=()[].&!~-+*/%<>^|?")
) / function(...) print('OP', ...) end

See also Peter "Corsix" Cawley's http://code.google.com/p/corsix-th/source/browse/trunk/LDocGen/c_tokenise.lua and the [C parser in trolledit].

C Parser

[ceg] - Wesley Smith's C99 parser

XML Parser

See the [XML parser in trolledit].

SciTE Lexers

[Scintillua] supports LPeg lexers. A number of [examples] are included.

Parsing UTF-8

Like Lua itself, LPeG only works with single bytes, not potentially-multibyte characters (which can occur in UTF-8). Here are some tricks that help you parse UTF-8 text.

lpeg.S()

The set function assumes that every byte is a character, so you can't use it to match UTF-8 characters. However, you can emulate it with the + operator.

local currency_symbol = lpeg.P('$') + lpeg.P('£') + lpeg.P('„') + lpeg.P('¢')

lpeg.R()

Likewise, the range operator works on single bytes only, so it cannot be used to match UTF-8 characters outside ASCII.

Character classes

The character classes provided by lpeg.locale() only work on single bytes, even under a UTF-8 locale. By using [ICU4Lua], you can create equivalent character classes which will match UTF-8 characters (regardless of the current locale):

-- lpeg_unicode_locale.lua

local lpeg = require 'lpeg'
local U    = require 'icu.ustring'
local re   = require 'icu.regex'

local utf8_codepoint
do
  -- decode a two-byte UTF-8 sequence
  local function f2 (s)
    local c1, c2 = string.byte(s, 1, 2)
    return c1 * 64 + c2 - 12416
  end

  -- decode a three-byte UTF-8 sequence
  local function f3 (s)
    local c1, c2, c3 = string.byte(s, 1, 3)
    return (c1 * 64 + c2) * 64 + c3 - 925824
  end

  -- decode a four-byte UTF-8 sequence
  local function f4 (s)
    local c1, c2, c3, c4 = string.byte(s, 1, 4)
    return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168
  end

  local cont = lpeg.R("\128\191")   -- continuation byte

  utf8_codepoint = lpeg.R("\0\127") / string.byte
    + lpeg.R("\194\223") * cont / f2
    + lpeg.R("\224\239") * cont * cont / f3
    + lpeg.R("\240\244") * cont * cont * cont / f4
end

local alnum = re.compile('^\\p{alnum}$')
local alpha = re.compile('^\\p{alpha}$')
local cntrl = re.compile('^\\p{cntrl}$')
local digit = re.compile('^\\p{digit}$')
local graph = re.compile('^\\p{graph}$')
local lower = re.compile('^\\p{lower}$')
local print = re.compile('^\\p{print}$')
local punct = re.compile('^\\p{punct}$')
local space = re.compile('^\\p{space}$')
local upper = re.compile('^\\p{upper}$')
local xdigit = re.compile('^\\p{xdigit}$')

return {
  alnum = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(alnum, U.char(c)) end ) ;
  alpha = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(alpha, U.char(c)) end ) ;
  cntrl = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(cntrl, U.char(c)) end ) ;
  digit = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(digit, U.char(c)) end ) ;
  graph = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(graph, U.char(c)) end ) ;
  lower = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(lower, U.char(c)) end ) ;
  print = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(print, U.char(c)) end ) ;
  punct = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(punct, U.char(c)) end ) ;
  space = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(space, U.char(c)) end ) ;
  upper = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(upper, U.char(c)) end ) ;
  xdigit = lpeg.Cmt ( utf8_codepoint , function (s,i,c) return not not re.match(digit, U.char(c)) end ) ;
}

In your code, you might use it like this:

local lpeg = require 'lpeg'
local utf8 = require 'lpeg_utf8_locale'
local EOF = lpeg.P(-1)
local word = lpeg.C(utf8.alnum^1)
local tokenise = ( word * (utf8.space^1 + EOF ) )^0 * EOF
print(tokenise:match('žetta eru ęšisleg orš'))

Date/Time

https://github.com/mozilla-services/lua_sandbox/blob/dev/modules/date_time.lua

LPeg Grammar Tester: http://lpeg.trink.com/share/date_time

Common Log Format

https://github.com/mozilla-services/lua_sandbox/blob/dev/modules/common_log_format.lua

Nginx meta grammar generator: http://lpeg.trink.com/share/clf

Rsyslog

https://github.com/mozilla-services/lua_sandbox/blob/dev/modules/syslog.lua

Rsyslog meta grammar generator: http://lpeg.trink.com/share/syslog

IP Address

https://github.com/mozilla-services/lua_sandbox/blob/dev/modules/ip_address.lua


RecentChanges · preferences
edit · history
Last edited March 9, 2014 4:14 am GMT (diff)