lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Cosmin Apreutesei once stated:
> Hi Sean, thanks for responding. My comments inline.
> 
> > I would do this as:
> >
> >         list            <- element (COMMA element)*
> >         element         <- length
> >                         /  name
> >
> >         length          <- 'length' EQ length_value
> >         name            <- 'name'   EQ name_value
> >
> >         length_value    <- %d+
> >         name_value      <- [a-z]+
> >
> >         COMMA           <- ','
> >         EQ              <- '='
> >
> >   Yes, both the length and name fields have a similar structure, but since
> > logically, they're of different semantic types, it makes sense (to me) to
> > separate them.  The reason I broke out the ',' and '=' sign as their own
> > productions is to provide a bit of documentation, and make it easier to add
> > whitespace:
> >
> >         COMMA           <- %s* ',' %s*
> >         EQ              <- %s* '=' %s*
> >
> 
> Problem here is the keywords are case-insensitive. But in practice
> it's even more complicated: consider that you can also escape
> characters with \code in the middle of the keyword. But you can only
> use escape codes when the keyword is between double-quotes. Ha.

  Case insensitive is easy:

	length	<- LENGTH EQ length_value

	LENGTH	<- [Ll][Ee][Nn][Gg][Tt][Hh]
	EQ	<- '='

Having escapes in the middle of keywords ... no so easy, if you want error
checking on the keyword names.

> These kinds of rules make me think that the parsing should be done in
> multiple stages, each stage parsing on the captures of the last one.
> Seems to me that http was designed to be parsed like this: first find
> out where headers stop (CRLF + CRLF), then separate the headers from
> one another (CRLF + non-space), then separate keywords from values
> (':'), then fold any duplicate headers, then convert all whitespace to
> a single space, then tokenize the values with a recursive parser
> (because of the damn quoted-strings).

  Well, HTTP headers share a structure with email headers, and the code (I'm
sending you directly in another email) does all the parsing in one go.  The
file is long (nearly 600 lines) but it does cover all documented headers for
email, MIME and Usenet (except for Resent-* headers---the semantics of those
are pretty nasty, and in all the email I have going back some twenty years
less than 5 have such headers so I'm not terribly worried about those) and
returns all the data in a Lua table.

> So what I want is the ability to apply a pattern on a capture, i.e. is
> match on the capture some more and give back some other captures in
> return, and continue from there (so it's all done in-context). Either
> that, or a different way of thinking about parsing that doesn't needs
> a feature like that.

  I think it can be done, but you'll need to add some calls out to Lua
functions at key points to check things.  Something like (this is untested):

	length	<- token => check EQ {:length: length_value :}
	name	<- token => check EQ {:name:   name_value   :}
	token	<- [a-z]+
		\  '"' char* '"'

	char	<- [a-z][A-Z]
		/  '\n' => nl
		/  '\b' => bs
	

And for the defs variable for re.compile()

	{
	  check = function(subject,position,capture)
	    if capture == 'length' then
	      return position
	    elseif capture == 'name' then
	      return position
	    else
	      return nil
	    end
	  end,

	  nl = function(subject,position,capture)
	    return position,"\n"
	  end,

	  bs = function(subject,position,capture)
	    return position,"\b"
	  end
	}

> > (What I really need to do is post my LPeg grammer for parsing email
> > headers---it really showcases nearly all the features of the re module, but
> > until I get around to that, if anyone is interested, I can mail them a copy;
> > I should note that HTML headers are pretty much the same format as email
> > headers)
> 
> Please do, that could help me a lot. Thanks.

  Will do.  

  -spc