lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 09/09/15 05:19 PM, Coda Highland wrote:
On Wed, Sep 9, 2015 at 2:50 AM, Dirk Laurie <> wrote:
2015-09-08 22:51 GMT+02:00 Ross Berteig <>:

UTF-8 is at least normalizable in a way that would stabilize and
be immune to further normalization.
I think the intention of the disclaimer "Any operation that needs
the meaning of a character, such as character classification,
is outside its scope. " is that the utf8 library does not claim to
provide the full Monty. This discussion has amply proved that
it is a nontrivial task to provide such a library.

In the documentation of the utf8 library there are provisos like
"assuming that the subject is a valid UTF-8 string". The scope
of the manual does not include spelling out what happens
when something is out of spec. For example, it is nowhere stated
what #tbl returns when the table is not a sequence.

I'm happy that the manual says enough to warn people that the
utf8 library is not an implementation of a standard.

A logician, a mathematician and a salesman visited Namibia
for the first time. From the window of their bus, a karakul
sheep could be seen.

"Amazing", said the salesman. "The sheep in Namibia are black".

"No", corrected the mathematician. "At least one sheep in
Namibia is black."

The logician pursed his lips and slowly brought the forefinger
and thumb of his right hand together. "There is at least one
sheep in Namibia, and the side of it that we can see is black."

The normalization to which I refer would be in scope for the limited
subset that the utf8 library supports -- simply converting all code
points in the string to a non-variable-width encoding (UCS-4),
collapsing paired surrogates in the process, and then re-encoding the
result into UTF-8. This process operates only on the byte-level
representation of the string and not upon the semantic meaning of any
codepoint therein except for surrogate pairs, which can be identified
by a straightforward range check.

Given that Lua strings are length-tagged instead of null-terminated,
and given that the input string should always be consumed one byte at
a time (that is, don't assume that a codepoint's initial bits
accurately indicate its length, but consume continuation bytes until
you reach a non-continuing byte or the end of string) it is not
possible to construct a string that will cause such a normalization
pass to crash or run indefinitely unless that string would cause that
to happen anyway (i.e. if you could crash Lua without needing utf8).

/s/ Adam

local function normalizeOverlong(s)
  local t = {}
  local i = 0
  for p,c in do
    i = i + 1
    t[i] = utf8.char(c)
  return table.concat(t, "", 1, i)

Adding surrogate pairs handling isn't too hard, either!

local function normalizeOverlongAndSurrogates(s)
  local t = {}
  local i = 0
  for p,c in do
    i = i + 1
    if c >= 0xD800 and c <= 0xDBFF then
      t[i] = {c-0xD800} -- errors on the table.concat
    elseif c >= 0xDC00 and c <= 0xDFFF then
      i = i - 1
      if type(t[i]) ~= "table" then error("Invalid surrogate sequence") end
      c = c - 0xDC00
      local x = t[i]
      c = utf8.char(0x10000 + x * 0x400 + c)
      t[i] = utf8.char(c)
  return table.concat(t, "", 1, i)

Disclaimer: these emails are public and can be accessed from <TODO: get a non-DHCP IP and put it here>. If you do not agree with this, DO NOT REPLY.