lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Fri, Sep 13, 2013 at 11:01 PM,  <lua.greatwolf@mamber.net> wrote:
> Hi all,
>
> I'm attempting to add JIT functionality(Will be using DynASM to help with
> this) to the reference LPeg 0.12 implementation but don't really have any
> prior experience doing this so not sure how far I'll get on this project.
> Maybe I'll learn something at the end.
>
> As a starting point I'm currently studying the implementation for vm parsing
> machine in lpvm.c. Additionally, I'm using the published peg documentation
> here by Roberto:
>
> http://www.inf.puc-rio.br/~roberto/docs/peg.pdf
>
> Anyways, onto my questions.
>
> I'm looking at the 'ISet' and 'ITestSet' opcodes but I'm not really
> understanding how it checks whether the current character is part of the set
> or not. Here's a code except for 'ISet':
>
>       case ISet: {
>         int c = (byte)*s;
>         if (testchar((p+1)->buff, c) && s < e)
>           { p += CHARSETINSTSIZE; s++; }
>
> "testchar" is a macro defined in lptypes.h:143 as:
>
>     #define testchar(st,c)    (((int)(st)[((c) >> 3)] & (1 << ((c) & 7))))
>
> Now according to the peg.pdf:
>
>> ... Sets are represented as bit sets, with one bit for each possible value
>> of a character. Each such instruction
>> uses 256 extra bits, or 16 bytes, to represent its set.
>
>
> There's probably a mistake in there because 16 bytes is actually 128-bits.
> Also LPeg had a rewrite since this doc was published so I don't know if all
> the info here is still accurate.
>
> So my question is, when there's a "ISet", "ITestSet" or possibly "ISpan"
> opcode in the compiled pattern, what would the opcode listing for that
> actually look like? Is the set of characters being checked for also encoded
> into this listing -- eg. similar to a line offset? If so, how many bytes
> would that take?
>
> Here's the union definition for opcode instructions in lpvm.h:
>
>   typedef union Instruction {
>     struct Inst {
>       byte code;
>       byte aux;
>       short key;
>     } i;
>     int offset;
>     byte buff[1];
>   } Instruction;
>
> So from this I can infer that each opcode instruction entry in the listing
> can be either i(an actual opcode instruction), an offset(for the opcode
> above it) or a byte buffer with 1 element in it.
>
> The two fields I'm unclear about is the purpose of "key" and "buff". "key"
> doesn't seem to be used during the interpreting phase so I'm guessing maybe
> it's used during pattern construction? In which case I can probably ignore
> that?
>
> Now the "buff" field is being used above in "testchar". Why is buff an array
> of byte with one element in it? Even if say this decayed into a pointer so
> it takes 4 bytes I still can't see how it fits in. Let say the pattern had a
> really big set. How would that translate into the opcode and what would the
> listing look like?
>
> Can anyone shed some light and help clear this up?
>
> Thanks!
>
>

The documentation is correct, except for the 16 bytes, which should read 32.

A set is thus represented in the opcode stream by an ISet instruction
followed by 256 consecutive bits to be addressed as an array of 32
bytes. Each bit acting as the presence or absence of a given character
in the set.

>     #define testchar(st,c)    (((int)(st)[((c) >> 3)] & (1 << ((c) & 7))))

This tests looks for the presence of the "c"-th bit in "st", the byte
array that starts at the buff address.

    (st)[((c) >> 3)] // select the correct byte ([0-31])
    & (1 << ((c) & 7)) bitwise and with the correct bit in this byte [0-7].

It can be done safely because the compiler always allocates the full
256 bits set.

Not sure why the end result is cast as an int, nor why the buffer is
defined as a one element array rather than a pointer... I suppose it
is meant to keep the compiler happy, I'm not a C expert at all.

Hope it helps.
-- Pierre-Yves