lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


tl;dr: JSON and XML are relatively easy to standardize because they have strong specifications. We don't have to design as much. That being said, there are still a lot of design problems even in those simple cases.

I warned you this was long.

> On Apr 23, 2017, at 2:11 AM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> 
> Briefly, I think we do not need a Python-like standard library as much
> as we need standards for those common tasks that tend to entice module
> writers into inventing ever-better wheels.
> 
> The standard should be as sparse as possible, so that
>  (a) There is not much that the user needs to know.
>  (b) The quirks of particular existing implementation do not get
> so fondly described as 'standard' that no other implementation
> qualiifies.

I agree that another possibility should be automated testing; it could define the first cut at interface compatibility. Then we can argue about specific test IDs, which tends to focus the mind more than generalities.

> 
> For example, a module conforming to the 'json' standard should
> be loadable as `json = require "bestjson"` and provide methods
> json.decode (string to table), encode (table to string), and a unique
> immutable value json.null. In this case it is obvious how the table
> should look, but the standard should document it. Anything else is extra.
> 
> When it comes to XML, it is no longer so obvious (does the attribute
> table go into [0] or a member named 'attire'?

My first attempt at documenting XML trees was in the 4.0 http://lua-users.org/wiki/XmlTree . And let's not forget: is the name of the tag called "name" or "tag"? One of them is *wrong*. :-)

I gave up on being right. LuaExpat presented lxp.lom as the API, and LOM trees were effectively standardized. So I switched to that. Have I mentioned semi-blessing?

It is not obvious to me whether LOM elements must be pure tables (i.e., no metatables, no proxies/userdata). That this is not obvious is probably because lxp.lom is intended as an implementation, and most people have the source to their tools; you can just read the code to answer questions like that. But if there's really an ecosystem around a data type, everybody may not be clear on which tacit requirements everybody else has. For example, here's something I've written and complained about too many times:

  for i=1,#t.attr do
    t.attr[i] = nil
  end

since I don't want pairs() to get the annoying numeric attributes. (This happens to be explicitly allowed by Matthew's spec.) 

My typical style is to put all kinds of non-XML annotation on LOM nodes. That is, I expect to be able to write t.foo = "bar", and have t.foo stay with the table, but not affect its XML content. A pattern for XML processing is tearing down the DOM tree to build your own tree, and the ability to scribble notes on the LOM nodes is very helpful--for starters, you can make a .parent link if you really need it. Is this allowed by the LOM?

Anybody formalizing LOM further should take a look at my ancient http://lua-users.org/wiki/LazyTree for a weird implementation of XML parse trees. [1] I think it's *probably* easily interoperable with everybody--well, everybody in the, uh, Lua 4.0 XML ecosystem. I should clean up some 5.x versions. Didn't everybody switch from XML to JSON already though? Maybe that makes this discussion easier...

One thing's still true. In case there are any pure table-centric implementations (and there are), we can't call methods on LOM nodes. So nearly everybody is going to need external functions for tree walkers, iterating over tags with certain names, etc. (Since Lua is the reimplement-it-yourself language, you probably end up with a good chunk of DOM/XPath/XFoo functionality replicated in your private nop.lom.* library. I bet it's super-fast though.) 

One of the functions nearly everybody has to use is lom.unparse, and it is almost as critical to get right as parsing.[2] Can I mix and match parsers and unparsers from different projects?

If interface definitions are in our future, I agree it would be very education for us to take the LOM spec and see if we can document it enough such that interesting alternative implementations are possible. I could hack up a 5.3 LazyTree if anybody wants to try tests against it.

One appealing test is a tree walker for consolidation of text runs, so you will always have {"ab"} instead of {"a", "b"}. This can be done inside a LOM parser relatively easily, but since implementations are free to break up text runs, many apps need to clean the text runs themselves in some way. [3] Another possibility is a tree walker that metatable-izes a tree in place, or makes a metatable-ized copy. Hey, at least the copy will share strings...

> must the code/decode be
> deterministically reversible?

XML applications must treat attribute order as insignificant, so any order is equally (non-)deterministic at the XML level. Only humans hand-editing XML have a case that the surface syntax should be left alone. [4]

Canonicalization of XML is a whole separate topic, but if you needed to know about it, you probably already do. Anyway, I tend to sort attribute names, but I often have controllable degrees of perversity in XML unparsing. :-)

> We would not need subjective evaluations (this json/xml codec
> is fast/reliable etc) but only objective ones (it conforms to the standard
> interface, it passes this test).

Many of the claimed XML parsers for Lua are nowhere near XML parsers--they fail to implement basic requirements. I'm not talking about DTDs, I'm talking character sets and escaping. Is that a quality-of-implementation issue? Note that many of the pure-Lua parsers explicitly say they don't conform, but people still like them. Do we knock them out in the test suite?

Jay

[1]: A lazy-loading tree makes sense for large JSON documents too, and the same consumption techniques apply for user-friendly high efficiency. Maybe I should extend LazyKit to JSON. Then I can find out how implementation-independent Dirk's hypothetical Lua ison module interface spec is....

[2]: Matthew Wild pointed out that using a strong parser like expat solves many problems when you are just cutting&pasting opaque trees. You should still see "HOWTO Avoid Being Called a Bozo When Producing XML" at https://hsivonen.fi/producing-xml/#serializer for the full argument. The whole document is great, and relevant to some parts of JSON processing too, so none of you are allowed to skip it.

[3]: Again, a lot of XML processing involves passing LOM subtrees opaquely; in that case, only the unparser would notice unconsolidated text runs, and it probably doesn't care.

[4]: And sometimes humans like the pretty-printed version better than the "untouched" version. But I already said that on the list recently.