lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, 21 Mar 2019 18:08:15 -0400
Sean Conner <sean@conman.org> wrote:

> It was thus said that the Great Steve Litt once stated:
> > On Tue, 19 Mar 2019 12:07:19 +0000
> > Geoff Smith <spammealot1@live.co.uk> wrote:
> > 
> >   
> > > Of course I had forgotten about not splitting on decimal points in
> > > numbers.  How can I adapt this to ignore the full stop character
> > > if surrounded by numbers?
> > > 
> > > Thanks for any solutions.  
> > 
> > The problem is in the specification. It's not easy to describe
> > what's a sentence ender and what's a decimal point. I'd split on a
> > dot followed immediately by whitespace: Space, Tab, Newline or
> > Formfeed.  
> 
>   Mr. Litt would break on a dot followed by whitespace. Mr. Conner
> would disagree, as he thinks e. e. cummings would also disagree.
> What constitutes a sentence?  Is this a sentence?
> 
>   -spc (He who took the No. 9 train.)

I have no idea whether the preceding is a sentence. Maybe the Chicago
Manual of Style would help?

You bring up an interesting point. No matter how wonderful our
sentence detection algorithm, there will always be exceptions. Maybe
the key is to go as far as possible with the general algorithm, and
then use a blacklist and whitelist for each of specific text in the
document and specific phrases.

Also, as Dirk pointed out, it might be better to split on a dot, one
or two spaces, and a capital letter. Unless, of course, you're
beginning the sentence with "systemd", whose producers insist on
spelling it with all small characters. Also, I forgot that sentences
can end with an exclamation point or a question mark. 

Is regex the best way, or might this better be done with callback
routines?

SteveT