lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great pygy79@gmail.com once stated:
> On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <sean@conman.org> wrote:
> >
> >
> >   I managed to generate a segfault with LPEG and I can reproduce the issue
> > with this code [1]:
> >
> > local lpeg = require "lpeg"
> > local Cg = lpeg.Cg
> > local Cc = lpeg.Cc
> > local Cb = lpeg.Cb
> > local P  = lpeg.P
> >
> > local cnt = Cg(Cc(0),'count')
> >           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> >           * Cb'count'
> >
> > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
> 
> LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> * 52 with Lua 5.3).
> 
> Just in case, your problem can be solved with a folding capture and no
> temp Lua variable:
> 
>     local cnt = Cf(
>       Cp() * P(1)^0 * Cp(),
>       function(first, last) return last - first end
>     )

  Cool solution, but it won't work for my use case.  Sigh.

  So here's the actual issue I was trying to solve.  I'm dealing with UTF-8
text in a terminal (xterm).  I want to trim a line of text to fit a line. 
utf8.len() won't work because it counts code points and not the actual
number of characters that will be drawn.  For examle, for the string

	x = "Spin̈al Tap"

string.len(x) returns 12, utf8.len(x) returns 11, but it takes 10 character
positions (that's a "Combining Diaeresis" over the 'n' character).  So there
are certain Unicode codepoints I want to skip counting---I want a "display
length", not a "codepoint length".  The example I gave was a bad example in
this case.

  Anyway, I do have code that works using lpeg.Carg():

local cutf8 = R" ~"                               -- ASCII minus C0 control set
            + lpeg.P"\194"     * lpeg.R"\160\191" -- UTF minus C1 control set [1]
            + lpeg.R"\195\223" * lpeg.R"\128\191"
            + lpeg.P"\224"     * lpeg.R"\160\191" * lpeg.R"\128\191"
            + lpeg.R"\225\236" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\237"     * lpeg.R"\128\159" * lpeg.R"\128\191"
            + lpeg.R"\238\239" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\240"     * lpeg.R"\144\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.R"\241\243" * lpeg.R"\128\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
            + lpeg.P"\244"     * lpeg.R"\128\143" * lpeg.R"\128\191" * lpeg.R"\128\191"

local nc   = P"\204"     * R"\128\191" -- combining chars
           + P"\205"     * R"\128\175" -- combining chars
           + P"\225\170" * R"\176\190" -- combining chars
           + P"\225\183" * R"\128\191" -- combining chars
           + P"\226\131" * R"\144\176" -- combining chars
           + P"\239\184" * R"\160\175" -- combining chars
           + P"\u{00AD}"               -- shy hyphen
           + P"\u{1806}"               -- Mongolian TODO soft hyphen
           + P"\u{200B}"               -- zero width space
           + P"\u{200C}"               -- zero-width nonjoiner space
           + P"\u{200D}"               -- zero-width joiner space
local cnt  = (nc + cutf8 * Carg(1) / function(s) s.cnt = s.cnt + 1 end)^0
           * Carg(1) / function(s) return s.cnt end

  It's not 100% perfect [2][3] but for what I'm doing, it works.

  -spc (I should mention that the string have had all control codes and
	sequences removed, so I do not need concern myself with that ...)

[1]	Definition of UTF-8 I'm using comes from RFC-3629

[2]	Doesn't handle RtL text; and there are other 0-width characters I'm
	missing.

[3]	I could repalce the \u{hhh} construction with something that works
	for Lua 5.1.