lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Steve Litt once stated:
> On Thu, 21 Mar 2019 18:08:15 -0400
> Sean Conner <sean@conman.org> wrote:
> 
> > It was thus said that the Great Steve Litt once stated:
> > > On Tue, 19 Mar 2019 12:07:19 +0000
> > > Geoff Smith <spammealot1@live.co.uk> wrote:
> > > 
> > >   
> > > > Of course I had forgotten about not splitting on decimal points in
> > > > numbers.  How can I adapt this to ignore the full stop character
> > > > if surrounded by numbers?
> > > > 
> > > > Thanks for any solutions.  
> > > 
> > > The problem is in the specification. It's not easy to describe
> > > what's a sentence ender and what's a decimal point. I'd split on a
> > > dot followed immediately by whitespace: Space, Tab, Newline or
> > > Formfeed.  
> > 
> >   Mr. Litt would break on a dot followed by whitespace. Mr. Conner
> > would disagree, as he thinks e. e. cummings would also disagree.
> > What constitutes a sentence?  Is this a sentence?
> > 
> >   -spc (He who took the No. 9 train.)
> 
> I have no idea whether the preceding is a sentence. Maybe the Chicago
> Manual of Style would help?
> 
> You bring up an interesting point. No matter how wonderful our
> sentence detection algorithm, there will always be exceptions. Maybe
> the key is to go as far as possible with the general algorithm, and
> then use a blacklist and whitelist for each of specific text in the
> document and specific phrases.
> 
> Also, as Dirk pointed out, it might be better to split on a dot, one
> or two spaces, and a capital letter. 

	Mr.
	Litt would break on a dot followd by whitespace.
	Mr.
	Conner would disagree, as he thinks e.
	e.
	cummings would also disagree.

> Unless, of course, you're
> beginning the sentence with "systemd", 

  e. e. cummings is also an exception here.

> whose producers insist on
> spelling it with all small characters. Also, I forgot that sentences
> can end with an exclamation point or a question mark.

  You forgot the interobang‽

> Is regex the best way, or might this better be done with callback
> routines?

  LPEG.  

  -spc (Definitely LPEG)