lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


For UCS4, make sure the code points are in range and aren't surrogates.

For UTF16, make sure that all the surrogates are paired.

On 1 Oct 2013 14:55, "Enrique Garcia Cota" <kikito@gmail.com> wrote:
So far I've been able to validate utf-8 strings:

https://github.com/kikito/utf8_validator.lua

It parses all the bytes in the string and makes sure they all conform to UTF-8.

I'm now looking for ways to validate utf-16 & utf-32 (as well as their LE & BE versions). I have so far not found how to do that on the unicode standard.

If someone has prior knowledge on that, please give a shout here.

Regards!


On Mon, Sep 30, 2013 at 11:44 AM, Enrique Garcia Cota <kikito@gmail.com> wrote:
Thanks for your answers guys. I will manually iterate over the whole string and look for inconsistencies to detect "binarity". Will report with my findings here.


On Fri, Sep 27, 2013 at 8:43 PM, William Ahern <william@25thandclement.com> wrote:
On Fri, Sep 27, 2013 at 12:47:30PM +0000, D. Matt Placek wrote:
> On Fri, Sep 27, 2013 at 10:55 AM, Enrique Garcia Cota <kikito@gmail.com>wrote:
>
> > Hello there,
> >
> > In my current setup I'm treating some strings in Lua and then storing them
> > in JSON.
> >
> > JSON expects strings in either UTF-8, UTF-16, UTF-32, in big endian or
> > little-endian. Binary blobs outside that is considered invalid.
> >
> Unfortunately, some of the data I'm receiving can be binary. I need to
> > detect those cases and escape the binary data somehow (probably with Base64
> > encoding).
> >
>
> The JSON spec (RFC4627) says:  "All Unicode characters may be placed within
> the quotation marks except for the characters that must be escaped: quotation
> mark, reverse solidus, and the control characters (U+0000 through U+001F)."
>
> I use a very simple JSON encoder that just scans the string character by
> character and substitutes the correct escape sequence whenever one of these
> characters is encountered.  I don't think you need to resort to Base64 or
> other binary encodings unless you really want to.

But if the string is non-ASCII (e.g. an ISO-8858 encoding) then the string
won't be valid Unicode. Escaping quotation marks, backslash, and control
characters won't fix that. A JSON parser could rightfully panic when
encountering invalid multibyte sequences, or drop them or transform them.

In practice, though, many JSON parsers/composers just do as you do. They
don't try to validate unescaped multi-byte sequences either coming or going
out.

But other implementations will cause trouble. That's why it's very common to
Base64-encode binary strings.