lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


It was thus said that the Great Geoff Smith once stated:
> This one has got me stuck for the moment, can anyone come up with an
> elegant solution for this without needing external library please.
> 
> I have a long string of text that i need to split into sentences, here is
> a sort of working attempt
> 
>  local text = "This is one sentence. This is another but with a number in it like 0.47 need to ignore it. This is the third. Fourth sentence"
> 
> local sentences = {}
> for i in string.gmatch(text,  "[^%.]+" ) do
> sentences[#sentences+1] = i
>  end
> 
> for i = 1, #sentences do
> print(i, sentences[i])
> end
> 
> Of course I had forgotten about not splitting on decimal points in
> numbers.  How can I adapt this to ignore the full stop character if
> surrounded by numbers?

  I had a similar issue back in 2014 [1] where I used LPEG to do the
parsing.  What I found is that just breaking on a period wasn't enough, and
so I had to special case the following [2]:

	MR.
	Mrs.
	MRS.
	Dr.
	DR.
	P. S.
	P.S.
	T. E.
	T.E.
	Gen.
	N. B.
	N.B.
	H.
	M.
	O.
	Z.

  The nice thing about LPEG was not only how easy it was to add exceptions
(like the list above) but I could also transform the input into a canonical
format (like converting N.B. to N. B.).

  So yes, I do have a solution, but it does violate your constraint.

  -spc (My use case was breaking the input into words, but it's similar
	enough ... )

[1]	https://github.com/spc476/NaNoGenMo-2014

	Code I used:

	https://github.com/spc476/NaNoGenMo-2014/blob/master/word.lua

[2]	Some, like Mrs. are generic, while T. E. were initials specific to
	the document.