[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: How do I make sure that a string is compatible with JSON (utf-8/16/32)?
- From: Coda Highland <chighland@...>
- Date: Fri, 27 Sep 2013 10:23:04 -0700
On Fri, Sep 27, 2013 at 9:42 AM, D. Matt Placek <atomicsuntan@gmail.com> wrote:
>
> On Fri, Sep 27, 2013 at 11:22 AM, Coda Highland <chighland@gmail.com> wrote:
>>
>> > I use a very simple JSON encoder that just scans the string character by
>> > character and substitutes the correct escape sequence whenever one of
>> > these
>> > characters is encountered. I don't think you need to resort to Base64
>> > or
>> > other binary encodings unless you really want to.
>>
>> Two major problems here:
>>
>> (1) Not every value is a valid Unicode character. There are several
>> ranges defined as illegal, for various reasons.
>>
>> (2) Whether 8, 16, or 32 bit, not every byte sequence is a legal UTF
>> representation.
>
>
> Sorry, you are absolutely right. Getting back to the OP's question, the
> only issue seems to be how to determine when base64 encoding is needed, and
> I think OP is correct that you will have to scan the string to check whether
> it contains any invalid sequences for a UTF representation (either that, or
> just base64 encoding the data all the time regardless). I couldn't easily
> tell from http://lua-users.org/wiki/LuaUnicode whether any of the unicode
> packages already provide a function to test whether or not a string is a
> valid UTF encoding.
In my opinion, it's not even worth it unless you KNOW you're dealing
with international text data. Just scan for bytes outside of
[8..11,13,32..126], and if there are any just base64 the whole thing.
The thing that makes it not worth it is that, according to the spec,
JSON can't represent non-Unicode strings EVEN IF YOU USE ESCAPES --
the only escape sequence available is \uXXXX, which only works for
legal Unicode characters, not arbitrary bytes (in particular, \u0000
is illegal). And certain byte sequences look like they MIGHT be part
of a legal UTF-8 sequence but are actually ill-formed (or at the very
least non-normalized, which is actually worse because it means it
won't round-trip through a system that uses UTF-16 as its internal
representation format).
/s/ Adam