Re: LPEG documentation needs more clarification

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: LPEG documentation needs more clarification
From: Adrian Perez de Castro <aperez@...>
Date: Fri, 15 Feb 2019 15:56:43 +0200

Hello!

On Fri, 15 Feb 2019 21:27:19 +0800, Sam Atman <atmanistan@gmail.com> wrote:

> It would seem there is a pure Lua wcwidth already: https://github.com/aperezdc/lua-wcwidth

Author of lua-wcwdith here! Reading this thread I just remembered that the
module needed an update to the latest Unicode version, so I just pushed
version 0.3 earlier today :)

If you end up using it, feel free to ask about about it. I mostly lurk
in lua-l nowadays, but I do read most of the threads anyway and I am
happy to help out.

Cheers,

-Adrián

> Sam Atman
> Principal Agent, Special Circumstances
> ~wep
> 
> > On Feb 15, 2019, at 10:24 AM, Xavier Wang <weasley.wx@gmail.com> wrote:
> > 
> > 
> > 
> > Sean Conner <sean@conman.org>于2019年2月15日 周五03:37写道：
> >> It was thus said that the Great pygy79@gmail.com once stated:
> >> > On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <sean@conman.org> wrote:
> >> > >
> >> > >
> >> > >   I managed to generate a segfault with LPEG and I can reproduce the issue
> >> > > with this code [1]:
> >> > >
> >> > > local lpeg = require "lpeg"
> >> > > local Cg = lpeg.Cg
> >> > > local Cc = lpeg.Cc
> >> > > local Cb = lpeg.Cb
> >> > > local P  = lpeg.P
> >> > >
> >> > > local cnt = Cg(Cc(0),'count')
> >> > >           * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> >> > >           * Cb'count'
> >> > >
> >> > > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
> >> > 
> >> > LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> >> > * 52 with Lua 5.3).
> >> > 
> >> > Just in case, your problem can be solved with a folding capture and no
> >> > temp Lua variable:
> >> > 
> >> >     local cnt = Cf(
> >> >       Cp() * P(1)^0 * Cp(),
> >> >       function(first, last) return last - first end
> >> >     )
> >> 
> >>   Cool solution, but it won't work for my use case.  Sigh.
> >> 
> >>   So here's the actual issue I was trying to solve.  I'm dealing with UTF-8
> >> text in a terminal (xterm).  I want to trim a line of text to fit a line. 
> >> utf8.len() won't work because it counts code points and not the actual
> >> number of characters that will be drawn.  For examle, for the string
> >> 
> >>         x = "Spin̈al Tap"
> >> 
> >> string.len(x) returns 12, utf8.len(x) returns 11, but it takes 10 character
> >> positions (that's a "Combining Diaeresis" over the 'n' character).  So there
> >> are certain Unicode codepoints I want to skip counting---I want a "display
> >> length", not a "codepoint length".  The example I gave was a bad example in
> >> this case.
> > 
> > Maybe you just need a wcwidth routine in luautf8 module 😃
> >> 
> >>   Anyway, I do have code that works using lpeg.Carg():
> >> 
> >> local cutf8 = R" ~"                               -- ASCII minus C0 control set
> >>             + lpeg.P"\194"     * lpeg.R"\160\191" -- UTF minus C1 control set [1]
> >>             + lpeg.R"\195\223" * lpeg.R"\128\191"
> >>             + lpeg.P"\224"     * lpeg.R"\160\191" * lpeg.R"\128\191"
> >>             + lpeg.R"\225\236" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>             + lpeg.P"\237"     * lpeg.R"\128\159" * lpeg.R"\128\191"
> >>             + lpeg.R"\238\239" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>             + lpeg.P"\240"     * lpeg.R"\144\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>             + lpeg.R"\241\243" * lpeg.R"\128\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>             + lpeg.P"\244"     * lpeg.R"\128\143" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >> 
> >> local nc   = P"\204"     * R"\128\191" -- combining chars
> >>            + P"\205"     * R"\128\175" -- combining chars
> >>            + P"\225\170" * R"\176\190" -- combining chars
> >>            + P"\225\183" * R"\128\191" -- combining chars
> >>            + P"\226\131" * R"\144\176" -- combining chars
> >>            + P"\239\184" * R"\160\175" -- combining chars
> >>            + P"\u{00AD}"               -- shy hyphen
> >>            + P"\u{1806}"               -- Mongolian TODO soft hyphen
> >>            + P"\u{200B}"               -- zero width space
> >>            + P"\u{200C}"               -- zero-width nonjoiner space
> >>            + P"\u{200D}"               -- zero-width joiner space
> >> local cnt  = (nc + cutf8 * Carg(1) / function(s) s.cnt = s.cnt + 1 end)^0
> >>            * Carg(1) / function(s) return s.cnt end
> >> 
> >>   It's not 100% perfect [2][3] but for what I'm doing, it works.
> >> 
> >>   -spc (I should mention that the string have had all control codes and
> >>         sequences removed, so I do not need concern myself with that ...)
> >> 
> >> [1]     Definition of UTF-8 I'm using comes from RFC-3629
> >> 
> >> [2]     Doesn't handle RtL text; and there are other 0-width characters I'm
> >>         missing.
> >> 
> >> [3]     I could repalce the \u{hhh} construction with something that works
> >>         for Lua 5.1.
> >> 
> >> 
> > -- 
> > regards,
> > Xavier Wang.
Non-text part: text/html

Attachment: pgpbvgGzz0Qhm.pgp
Description: PGP signature

Follow-Ups:
- Re: LPEG documentation needs more clarification, Sean Conner

References:
- Re: LPEG documentation needs more clarification, pygy79
- Re: LPEG documentation needs more clarification, Sean Conner
- Re: LPEG documentation needs more clarification, Xavier Wang
- Re: LPEG documentation needs more clarification, Sam Atman

Prev by Date: [ANN] wcwidth 0.3 - Calculate number of character cells used by an Unicode rune
Next by Date: Re: LPEG documentation needs more clarification
Previous by thread: Re: LPEG documentation needs more clarification
Next by thread: Re: LPEG documentation needs more clarification
Index(es):
- Date
- Thread