lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]




On 2019-03-27 10:10 p.m., Coda Highland wrote:
On Wed, Mar 27, 2019 at 6:47 PM Soni "They/Them" L. <fakedme@gmail.com <mailto:fakedme@gmail.com>> wrote:

    I'm rolling with this weird idea. is it possible to make a simple
    parser
    out of tables? I tried some stuff but ran into a few issues. I don't
    wanna use LPeg or anything more complicated than string.find(foo,
    bar,
    1, true).

    e.g.:

    local ltk = {} -- lua_tokens
    ltk.string = {}
    ltk['"'] = "string"
    ltk["'"] = "string" -- how to handle end of string correctly?
    ltk.string['\\'] = "escape"
    ltk.string.escape = {}
    ltk.string.escape['z'] = {}
    ltk.string.escape['z'].next = ltk.string.escape['z']
    for i, v in ipairs({" ", "\t", "\n", etc}) do
       ltk.string.escape['z'][v] = "next"
    end
    ltk.string.escape['z'][''] = ltk.string -- not sure if there'd be a
    better way to do this
    -- etc


Not a bad start for someone who's never done this formally before.

There are a zillion different ways to define a parser. The Dragon Book recommendation is well-given.

I think the insight you're missing is that the table really represents the structure of a state machine. Your top-level table shouldn't contain things like ["'"] = "string" directly; rather, you'd want something like ltk.base["'"]. Then you wouldn't have ltk.string.escape, you'd just have ltk.string and ltk.escape. The nesting doesn't belong in the table itself; instead, your parser will maintain a stack of states, and you push a state when you start into a parsing rule and you pop it off when you've finished that rule.

Yeah but then I wouldn't have self-referential tables. Besides I wasn't planning on having a stack. FSMs (same class as regex) don't have a stack.


The place where this is suddenly going to get complicated is if your grammar contains any prefix ambiguities (e.g. "fork" vs "forward"), because then you have to implement backtracking. For most simpler grammars it's best to just break it into two passes -- one that makes simple lexical tokens out of a string of characters, and one that evaluates the syntax out of a list of lexical tokens.

"fork" vs "forward" is unambiguous and doesn't require backtracking.


I don't mind offering assistance with this. I've actually built a complete standards-compliant parser for C++ from first principles before. I think it's pretty fun.

/s/ Adam