lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Let me back up a second. I believe 

  o  Lua should be able to work with UTF-8 in some useful way;

  o The support is for UTF-8, not Unicode;

  o  UTF-8 is an encoding for data exchange between systems, and is currently defined by RFC 3629;

  o  The core can't provide any support like string.lower outside of US-ASCII;

  o  Single-byte character sets like ISO-8859-1 may happen to work with your C locale functions, but that’s not a promise.

Please let me know if these beliefs are wrong.

On Jun 29, 2017, at 7:45 PM, Duane Leslie <parakleta@darkreality.org> wrote:

> Also, I have noticed that the `utf8_decode` function passes the UTF-16 surrogates which are illegal codepoints, so this might also need to be fixed.

Ahh, now I remember why I kept my own UTF-8 validator. Lua’s behavior seems out of conformance with RFC 3629, and this isn’t just a SHOULD in the RFC, it’s a MUST.

Quoting https://tools.ietf.org/html/rfc3629#section-3 :

> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF [...]
> 
> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.

Let’s see how we’re doing.

function u(t)
  if t.rfc then print(“MUST from RFC 3629:") end
  print(t[1], utf8.len(t[2]))
  print("expected", table.unpack(t.expect))
  print()
end

u{"zero, encoded", "\xC0\x80",
  expect={nil, 1}, rfc=true}

u{"bad CESU pair", "\xED\xA1\x8C\xED\xBE\xB4",
  expect={nil, 1}, rfc=true}

u{"half pair", "\xED\xA1\x8C",
  expect={nil, 1}}

u{"half plus A", "\xED\xA1\x8C" .. "A",
  expect={nil, 1}}

u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
  expect={1}, rfc=true}

===

MUST from RFC 3629:
zero, encoded	nil	1
expected	nil	1

MUST from RFC 3629:
bad CESU pair	2
expected	nil	1

half pair	1
expected	nil	1

half plus A	2
expected	nil	1

MUST from RFC 3629:
astral char	2
expected	1

===

Note that the surrogate behavior is explicitly called out in RFC 3629’s “Security Considerations”, https://tools.ietf.org/html/rfc3629#section-10 .

-- 
Jay Carlson
nop@nop.com