lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On May 22, 2013, at 8:47 PM, Geoff Leyland <geoff_leyland@fastmail.fm> wrote:

> This is only a slight improvement, but it seems to work.
> 
>> print( ( 'aaaa aaa aa a aaaaa' ):gsub( '([aeiou])[aeiou]', '%1' ))

Perfect. Combined with ASCII transliteration, and, voilà, a very simple word normalization of sort:

print( 1, Unidecode( 'blåbærsyltetøj' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 2, Unidecode( 'blåbärsyltetöj' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 3, Unidecode( 'blaabaarsyltetoej' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 4, Unidecode( 'blabarsyltetoj' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 5, Unidecode( 'Räksmörgås' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 6, Unidecode( 'Göteborg' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 7, Unidecode( 'Gøteborg' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 8, Unidecode( 'Über' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 9, Unidecode( 'ueber' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 10, Unidecode( 'uber' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )
print( 11, Unidecode( 'uuber' ):lower():gsub( '([aeiou])[aeiou]', '%1' ) )

1	blabarsyltetoj	1
2	blabarsyltetoj	0
3	blabarsyltetoj	3
4	blabarsyltetoj	0
5	raksmorgas	0
6	goteborg	0
7	goteborg	0
8	uber	0
9	uber	1
10	uber	0
11	uber	1

Examples curtesy of https://issues.apache.org/jira/browse/LUCENE-5013