Hi,
I'm the author of unluac, a
popular Lua decompiler.
I noticed that after posting. Kudos! I see it is still actively worked on.
I did not find much in the way of comments in the Java code, or in terms of command-line options, but as you report the decompiler does seem to work very well and the classes and functions seem to be small, modular and easy to understand.
As best as I can tell, it seems to work pretty much the way the other decompiler works and like others I have seen for Python. You need a disassembly for bytecode instructions (which is different than bytecode only in that this is an instruction structure rather than a grab bag of bytes from which you carve instructions out of), and then from that some _expression_ trees are built and then there is tree side which has code in it to print out constructs for that tree node.
One thing I have found useful on the tree side is to use the same names as the nonterminal names in
https://www.lua.org/manual/5.4/manual.html#9. So for example, since the grammar calls a statement, "stat", I use that same notation for the equivalent thing in my code (however funny that name "stat" or "retstat" might seem). Also, since the decompiler builds an entire tree for a function or a module before doing any printing, I can print out the tree produced and that helps both in my debugging and in informing users how to think about how the grouping of instructions works.
I tried running unluac on the issues from the other C-based decompiler, and the logic bugs there were fixed. However
https://github.com/viruscamp/luadec/issues/47 seems to have the same problem as reported in that bug. Whether this is really a semantic difference (which in my opinion would be a real bug) or a "possibly nice to have" but not really a bug because the two are semantically the same, I don't know enough about Lua to really know.
If the latter, in the decompilation framework I use, it would be something added to the higher-level tree transformation layer.
In short, what is the state of decompilers
for Lua such and for Lua version 5.4?
Unluac supports Lua 5.4. I consider unluac to be very good for
all supported versions (5.0+) as long as it has debug info (and
the bytecode is output of the normal PUC-Rio compiler). Most files
(especially "real" files) can be decompiled perfectly (meaning
re-compiling will give back the exact original file except for
line numbers). Probably the main deficiency in this context is
with "goto" which I think most people don't use in a way that
unluac doesn't support. (This is not really a problem that should
be hard to solve; I just haven't gotten around to it.)
To get a sense of the goodness, and as I wrote above it looks like there is goodness there, I tried compiling and decompiling "all.lua" from the Lua 5.4.4 test suite. And then ran the resulting decompiled Lua. Just like the Lua code before decompilation, the decompiled Lua program seemed to work the same way and produce the same results. A more rigorous test would have been to do this on all of the 40 or so Lua test programs. But I also did notice that unluac comes with about 260 tests of its own.
There is something very neat about these kinds of decompilers which decompile back to Lua (as opposed to an invented and common decompiler language): there is both an automated process for finding bugs as well as testing for bugs without having to write any custom tests. Of course custom tests are desirable for testing isolated parts of the decompiler or for narrowing bugs found in the automated bug-finding process. You just take all of the lua test suites that anyone has written, compile these, decompile these, and then run them. For Python I can easily find 100K test programs on my disk alone. The numeric, scientific, natural-language processing and machine learning packages have that in their teste programs just in of themselves.
Has Lua bytecode or its compilation
changed significantly?
It was overall pretty easy to add 5.4 support to unluac. There are
a lot of new and changed opcodes in Lua 5.4, but hardly any of
them really do anything new. TBC is maybe the main new thing in
the bytecode, and it didn't seem that hard to deal with. Other
issues were probably pretty specific to unluac's implementation,
or esoteric.
complex branching
I have no basis to compare with Python, but I looked at some of
your stuff. Using a parser to decompile control structures is
interesting, and I can see some appeal, although I would have to
dig in a lot more about it to figure out if it would actual solve
any of the really difficult issues.
The thing that I am pretty sure solves everything is to generate a control-flow graph and from that build a dominator tree. I don't have that code publicly visible and have not gotten through a full Python version using it. However, so far it has been very helpful. What is weird is that I am finding that although it is needed rarely, in those cases where it is needed, it is invaluable. And the cases where it is not strictly needed it has been very informative about how to write the rules even if the information doesn't need to be consulted explicitly
Lua does have some "branch optimization," and this is indeed one
of the things that causes the most trouble in decompiling control
flow. This can interact with "break" and other unconditional jumps
(like "else," "goto" (when it is supported), and also jumps that
are evidence of loops or conditionals that have otherwise been
optimized out), and this is probably the largest case of
complexity.
Short-circuit logic and its interaction with loops can be difficult to sort out too. For example a loop with some sort of logic or an if statement that can't be folded into the loop condition because the if statement doesn't reach to the end of the block.