lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Some of you may be familiar with Rob Pike's work on 'structural
regular expressions'.  It's a great, simple idea: instead of looking
at a file as a sequence of lines, let's generalize to allow arbitrary
regular expressions to mark boundaries between units.  You can read
more about it at http://c2.com/cgi/wiki?StructuralRegularExpressions;
there's a link to Rob's paper from there.  But I'll try to summarize
the idea.

For example, a boundary marker of '\n' reduces to the familiar
sequence-of-lines model, but a boundary marker of '\n\n+' would allow
you to process a file as a sequence of paragraphs.  If you have a MIME
multipart message, you can similarly match on the boundary strings.

Rob implemented this idea in his text editor 'sam', in which the 'x'
command is like Lua's string.gmatch, in that it iterates over all
matches to a regular expression.  But there is also a 'y' command,
which iterates over the text *between* matches of a regular
expression.  (It appears in Rob's paper but it not very well
explained; I'll quote the explanation from Rob's length paper on sam:

  Of course, the text extracted by x [think string.gmatch] may be
  selected by a regular expression, which complicates deciding what
  set of matches is chosen --- matches may overlap. This is resolved
  by generating the matches starting from the beginning of dot using
  the leftmost­longest rule, and searching for each match starting
  from the end of the previous one. Regular expressions may also match
  null strings, but a null match adjacent to a non­null match is never
  selected; at least one character must intervene. For example,

     , c/AAA/           -- change the file to AAA
     x/B*/ c/­/         -- for each string matching B*, change it to -
     , p                -- print the file
     produces as output
     ­A­A­A­

  because the pattern B* matches the null strings separating the A's.
  The x command has a complement, y, with similar syntax, that
  executes the command with dot set to the text between the matches
  of the expression. For example, 

    , c/AAA/
    y/A/ c/­/
    , p

  produces the same result as the example above.

Several times I have needed something like sam's y operator in Lua,
and I have always implemented it horribly using string.find and
iterators, and it's always been difficult to work the bugs out.

Recently, rather than do this for the nth time, I thought:
string.gmatch already has all the information I need---why can't *it*
give the text I want?  Here's an example:

  : nr@homedog 10444 ; lua5.1 -lstringx
  Lua 5.1.2  Copyright (C) 1994-2007 Lua.org, PUC-Rio
  > for before, match in stringx.gmatch('Hello world', 'l', true) do
      print(string.format('%q', before),
            match and string.format('%q', match) or match)
    end
  "He"    "l"
  ""      "l"
  "o wor" "l"
  "d"     nil

The idea is to extend gmatch to take a third argument which, if true,
causes gmatch to return as the first result the string before the
match, or at the very end where there is no match, the tail of the
string that did not match.   (That's why it has to be the first result
rather than the last, say.)

It turns out this is pretty easy to do: gmatch *does* have all the
information, and I had to add or change only 24 lines of code.  I
would love to see this extension in the next release of Lua, so that
everyone can use this kind of functionality.  I would also welcome
suggestions about a similar extension to gsub, so that we can
substitute for text *between* regular expressions, as in Rob's example
above.  (Here the problem is that the interface to string.gsub is
already far too complicated, so what is really needed is another
function.)

I attach a patch against Lua 5.1.2.  Next release, anyone?


Norman

--- /home/nr/net/lua-5.1.2/src/lstrlib.c	2007-03-23 13:06:34.000000000 -0400
+++ strlibx.c	2008-01-01 12:00:48.000000000 -0500
@@ -549,33 +549,53 @@
   size_t ls;
   const char *s = lua_tolstring(L, lua_upvalueindex(1), &ls);
   const char *p = lua_tostring(L, lua_upvalueindex(2));
+  size_t offset = (size_t)lua_tointeger(L, lua_upvalueindex(3));
+  int return_intercalations = lua_toboolean(L, lua_upvalueindex(4));
   const char *src;
   ms.L = L;
   ms.src_init = s;
   ms.src_end = s+ls;
-  for (src = s + (size_t)lua_tointeger(L, lua_upvalueindex(3));
+  for (src = s + offset;
        src <= ms.src_end;
        src++) {
     const char *e;
     ms.level = 0;
     if ((e = match(&ms, src, p)) != NULL) {
+      unsigned results;
       lua_Integer newstart = e-s;
       if (e == src) newstart++;  /* empty match? go at least one position */
       lua_pushinteger(L, newstart);
       lua_replace(L, lua_upvalueindex(3));
-      return push_captures(&ms, src, e);
+      results = push_captures(&ms, src, e);
+      if (return_intercalations) {
+        lua_pushlstring(L, s+offset, src-(s+offset));
+        results++;
+        lua_pushvalue(L, -results);
+      }
+      return results;
     }
   }
+  if (return_intercalations) {
+    lua_pushboolean(L, 0);
+    lua_replace(L, lua_upvalueindex(4));
+    lua_pushlstring(L, s+offset, ls-offset);
+    lua_pushnil(L);
+    return 2;
+  } else {
   return 0;  /* not found */
+  }
 }
 
 
 static int gmatch (lua_State *L) {
   luaL_checkstring(L, 1);
   luaL_checkstring(L, 2);
-  lua_settop(L, 2);
+  if (lua_gettop(L) == 2)
+    lua_pushboolean(L, 0);
+  lua_settop(L, 3);
   lua_pushinteger(L, 0);
-  lua_pushcclosure(L, gmatch_aux, 3);
+  lua_insert(L, -2);  /* maximize compatibility with existing code */
+  lua_pushcclosure(L, gmatch_aux, 4);  /* string pat pos intercalate */
   return 1;
 }
 
@@ -856,8 +876,8 @@
 /*
 ** Open string library
 */
-LUALIB_API int luaopen_string (lua_State *L) {
-  luaL_register(L, LUA_STRLIBNAME, strlib);
+LUALIB_API int luaopen_stringx (lua_State *L) {
+  luaL_register(L, "stringx", strlib);
 #if defined(LUA_COMPAT_GFIND)
   lua_getfield(L, -1, "gmatch");
   lua_setfield(L, -2, "gfind");