|
For UCS4, make sure the code points are in range and aren't surrogates.
For UTF16, make sure that all the surrogates are paired.
So far I've been able to validate utf-8 strings:It parses all the bytes in the string and makes sure they all conform to UTF-8.I'm now looking for ways to validate utf-16 & utf-32 (as well as their LE & BE versions). I have so far not found how to do that on the unicode standard.If someone has prior knowledge on that, please give a shout here.Regards!On Mon, Sep 30, 2013 at 11:44 AM, Enrique Garcia Cota <kikito@gmail.com> wrote:
Thanks for your answers guys. I will manually iterate over the whole string and look for inconsistencies to detect "binarity". Will report with my findings here.
On Fri, Sep 27, 2013 at 8:43 PM, William Ahern <william@25thandclement.com> wrote:But if the string is non-ASCII (e.g. an ISO-8858 encoding) then the stringOn Fri, Sep 27, 2013 at 12:47:30PM +0000, D. Matt Placek wrote:
> On Fri, Sep 27, 2013 at 10:55 AM, Enrique Garcia Cota <kikito@gmail.com>wrote:
>
> > Hello there,
> >
> > In my current setup I'm treating some strings in Lua and then storing them
> > in JSON.
> >
> > JSON expects strings in either UTF-8, UTF-16, UTF-32, in big endian or
> > little-endian. Binary blobs outside that is considered invalid.
> >
> Unfortunately, some of the data I'm receiving can be binary. I need to
> > detect those cases and escape the binary data somehow (probably with Base64
> > encoding).
> >
>
> The JSON spec (RFC4627) says: "All Unicode characters may be placed within
> the quotation marks except for the characters that must be escaped: quotation
> mark, reverse solidus, and the control characters (U+0000 through U+001F)."
>
> I use a very simple JSON encoder that just scans the string character by
> character and substitutes the correct escape sequence whenever one of these
> characters is encountered. I don't think you need to resort to Base64 or
> other binary encodings unless you really want to.
won't be valid Unicode. Escaping quotation marks, backslash, and control
characters won't fix that. A JSON parser could rightfully panic when
encountering invalid multibyte sequences, or drop them or transform them.
In practice, though, many JSON parsers/composers just do as you do. They
don't try to validate unescaped multi-byte sequences either coming or going
out.
But other implementations will cause trouble. That's why it's very common to
Base64-encode binary strings.