lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Tue, Jan 24, 2012 at 1:34 PM, Tony Finch <dot@dotat.at> wrote:
> Jay Carlson <nop@nop.com> wrote:
>
> Thanks for your interesting response.

It is more interesting when I am not just shouting at clouds! :-)

> You might design a system with more precisely typed data, where the type
> of a string corresponds to its formal language, i.e. the syntax of the
> string. So strings aren't just strings, they are SQL strings or JSON
> strings or passwords, etc. Then each slot in a template needs to know the
> type of string that it accepts. So a SQL query template might have a slot
> that only accepts a quoted SQL string literal. You can't just interpolate
> a username into that slot, you need to do a type conversion, and part of
> that conversion includes putting quotes around the string and escaping
> metacharacters correctly.

Yeah. That's better than I could describe it, especially for program
analysis. In the statically- and explicitly-typed world, this seems
like the way to deal with string manipulation--at minimum resulting
strings are in the formal language L, and restrictions on the
languages L1,L2,etc of the placeholders can be made strongly enough
that they keep unwanted non-terminals in L from appearing. I am not
sure the escaping can be made automatic merely from Ln.

There is a similar problem in type systems: is struct{int a=0} the
same a another struct{int a=0}? Or struct{int b=0}? Modula-3 solved
this by adding a BRANDED "ABCDEF" qualifier to type declarations,
making all struct {int a=0} the same type, but struct BRANDED "foo"
were distinct. as were BRANDED "bar". (You can omit the string, in
which case the compiler makes a unique one for you.) Most interface
chains bottomed out in an REVEAL block where type T was a concrete
BRANDED OBJECT subtyping the partially revealed type(s) in the public
interface(s).[1]

Besides, there are still many languages where there simply is no way
of expressing an arbitrary string as a single literal because the
language lacks adequate escapes--or even prohibits characters. Anyone
writing an ASCII 12 aka Form Feed is going to be unhappy when trying
to communicate it in XML. (I see pain in your future. Plus perhaps
external unparsed entities. "What the heck are those?" is a likely
reaction from both programmers and software.)

> There have been at least two Google templating libraries that sort of
> works along these lines, but they use parsers with a lot of built-in
> knowledge of web languages to derive the typing requirements.

> http://googleonlinesecurity.blogspot.com/2009/03/reducing-xss-by-way-of-automatic.html

That's quite cool, given its goals. I'd recommend reading that page to
anybody thinking about writing or using template systems because of
this example:

====
In this template, four variables are used (not in this order):

    USER_NAME is inserted into regular HTML text and hence can be
escaped safely by HTML-escape.
    USER_ACCOUNT_URL is inserted into an HTML attribute that expects a
URL and therefore in addition to HTML-escape, also requires validation
that the URL scheme is safe. By allowing only a safe white-list of
schemes, we can prevent (say) javascript: pseudo-URLs, which
HTML-escape alone does not prevent.
    USER_COLOR is inserted into a Cascading Style Sheets (CSS) context
and therefore requires an escaping that also prevents scripting and
other dangerous constructs in CSS such as those possible in
expression() or url(). For more information on concerns with harmful
content in CSS, refer to the CSS section of the Browser Security
Handbook.
    USER_ID is inserted into a Javascript variable that expects a
number as it is not enclosed in quotes. As such, it requires an
escaping that coerces it to a number (which a typical
Javascript-escape function does not do), otherwise it can lead to
arbitrary javascript execution. More variants may be developed to
coerce content to other data types, including arrays and objects.

Each of these variable insertions requires a different escaping method
or risks introducing XSS. To keep the example small, we excluded
several contexts of interest, particularly style tags, HTML attributes
that expect Javascript (such as onmouseover), and considerations of
whether attribute values are enclosed within quotes or not (which also
affects escaping).
====

> http://google-caja.googlecode.com/svn/changes/mikesamuel/string-interpolation-29-Jan-2008/trunk/src/js/com/google/caja/interp/index.html

That's closer to the level I'm working at; I think I can steal from
it. Bonus points for citing Steve Christey's *2007* paper
"Unforgivable Vulnerabilities".

> And they are rather heavy-weight, so difficult to imitate.

The first is trying to cheaply clean up after twenty years of ad-hoc
language design implemented in systems excessively liberal in what
they accept. I think it's entitled to a little complexity.

> A more principled approach is to parse everything as it comes in off the
> wire into a tree-structured internal representation. Instead of
> interpolating strings into templates you graft tree nodes into slots.
> You then serialize the parse tree back into its external representation
> when sending it out, and the serializer will naturally either do the right
> thing or fail clearly.

This is my position, certainly for Lua. If writing code to produce
correct results when compositing implicitly structured strings is
difficult (formal language problem) and it's not well supported, we
should try to work with the language features designed for structured
composite data instead. Push knowledge of string languages to the
edges of the system, where they can be written once and carefully
examined.

In a beautiful irony, a thread below this was asking about how to
using strings containing "&" in LuaSOAP. The ur-LOM marshaller in
http://www.place.org/~nop/luaxmlrpc-0.0.tar.gz : xmlgen.lua, dated
2001-11-27, got this right.[2] Given valid tag and attribute names,
the only things it doesn't do is validate Unicode on the way out.

Jay

[1]: Chains of public interfaces of types in Modula-3? Yeah, Java was
heavily influenced by it.

[2]: I wrote the expat binding that week because I was freaking out
about the correctness of Lua XML processing; I needed it for XML-RPC.
It turns out that the XML-RPC spec (which is "frozen forever")
requires strings to express arbitrary octets like Form Feed. So
because XML-RPC implementations must be XML applications, there can be
no XML-RPC implementations because they cannot simultaneously satisfy
both requirements. Dave Winer clarified--that is, unfroze the spec--to
say the XML conformance took precedence; he had not encountered this
or the Unicode issues at the time of writing the spec because he did
not have an actual XML parser in his implementation in Userland.
Userland didn't have timezone-aware date/time handling either, and
guess what's missing in XML-RPC too. As revealed by later events, the
lessons that "it's OK to say other people know stuff too" and
"standardization is not like software design" did not seem to take.