Trying to learn LPeg using the Swedish Chef

lua-l archive
[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]
Subject: Trying to learn LPeg using the Swedish Chef
From: Eric Wing <ewmailing@...>
Date: Mon, 30 Jan 2012 02:18:13 -0800
Hi all,
I have been trying to learn LPeg in my so called free time. I've done
a few simple things with it so far, so I wanted to try something
slightly more challenging (though not too much) and something
interesting/fun. So I decided I would try to port the famous
'Encheferizer' lex grammar by John Hagerman which translates text into
a phonetic output that the Swedish Chef from the Muppets would speak.
Maybe this was too hard, but I was hoping some of you LPeg experts out
there might give me some help/tips/insights. (I am still using the RE
module because I thought this was more appropriate for a beginner
skill set which is where I still am at.)

After a few weekends, I have something that is pretty close. But it's
still not quite right and I think I may have some fundamental flaws.


bork = re.compile[[
	text <- {~ item* ~}
	WordChar <- [A-Za-z']
	NotWord <- [^A-Za-z']
	item <- ProcessedWord / NotWord

	ExemptWord <- 'bork'
		/ 'Bork'

	EndOfParagraphPunctuation <- [.!?]%nl -> '
Bork Bork Bork!
'
  	AccentSyllable <- 'an' -> 'un'
		/ 'An' -> 'Un'
		/ 'au' -> 'oo'
		/ 'Au' -> 'Oo'
		/ 'the' -> 'zee'
		/ 'The' -> 'Zee'
		/ 'v' -> 'f'
		/ 'V' -> 'F'
		/ 'w' -> 'v'
		/ 'W' -> 'V'

	AccentPrefixSubstitution <- 'e' -> 'i'
		/ 'E' -> 'I'
		/ 'o' -> 'oo'
		/ 'O' -> 'Oo'

	AccentSuffixSubstitution <- 'en' -> 'ee'
		/ 'th' -> 't'

	InWordAccentSuffixSubstitution  <-  'e' -> 'e-a'

	InWordAccentSubstituion <- 'ew' -> 'oo'
		/ 'f' -> 'ff'
		/ 'ir' -> 'ur'
		/ 'ow' -> 'oo'
		/ 'o' -> 'u'
		/ 'tion' -> 'shun'
		/ 'u' -> 'oo'
		/ 'U' -> 'Oo'
		/ 'i' -> 'ee'		
	
	AccentFollowedBySyllableSubstitution <-  'an' -> 'un'
		/ 'An' -> 'Un'
		/ 'au' -> 'oo'
		/ 'Au' -> 'Oo'
		/ 'a' -> 'e'
		/ 'A' -> 'E'

	CombinedInWordAccent <- InWordAccentSubstituion / AccentSyllable

	CombinedInWordAccentThatHasNoSuffixOrChar <- CombinedInWordAccent / WordChar

	CombinedInWordAccentThatHasSuffix <-
AccentFollowedBySyllableSubstitution / CombinedInWordAccent
	CombinedInWordAccentThatHasSuffixOrChar <-
CombinedInWordAccentThatHasSuffix / WordChar

	CombinedAnyTimeSuffix <-  AccentSuffixSubstitution /
InWordAccentSuffixSubstitution

	CombinedAccentSyllableOrChar <- AccentSyllable / WordChar

	CombinedAccentSyllableOrCharThatWillBeFollowedByAnotherCharacter <-
AccentFollowedBySyllableSubstitution / CombinedAccentSyllableOrChar

	ProcessedInWordAccentThatEndsWithSuffix <- CombinedAnyTimeSuffix NotWord
		/ CombinedInWordAccentThatHasSuffix
		/ WordChar
	
	ProcessedInWordAccentThatEndsWithNoSuffix <-
CombinedInWordAccentThatHasSuffixOrChar CombinedInWordAccent NotWord
		/ CombinedInWordAccentThatHasSuffix
		/ WordChar	
	
	ProcessedWord <- ExemptWord NotWord
		/ EndOfParagraphPunctuation
		/ AccentPrefixSubstitution ProcessedInWordAccentThatEndsWithSuffix+
		/ AccentPrefixSubstitution NotWord		
		/ CombinedAccentSyllableOrCharThatWillBeFollowedByAnotherCharacter
ProcessedInWordAccentThatEndsWithSuffix+
		/ CombinedAccentSyllableOrCharThatWillBeFollowedByAnotherCharacter
ProcessedInWordAccentThatEndsWithNoSuffix+
		/ CombinedAccentSyllableOrChar NotWord
		/ AccentSuffixSubstitution NotWord

]]

local welcomestring = [[
Welcome to the wonderful world of the Sweedish Chef! Enclosed
in this archive are four files:
]]

result = lpeg.match (bork, welcomestring)

-- Some Comments:
-- CombinedInWordAccent: Inside a word, evaluates substitions in a
word (InWordAccentSubstituion) or any general substitution
(AccentSyllable).
-- CombinedInWordAccentThatHasNoSuffixOrChar: Inside a word. Like
CombinedInWordAccent except it also matches a WordChar as a last
match. Intended to be used for matches inside a word, but those words
won't have a well known suffix.
-- CombinedInWordAccentThatHasSuffix: Inside a word and assumes will
be ended with a suffix. Evaluates CombinedInWordAccent plus
AccentFollowedBySyllableSubstitution (because I know there will be
following characters due to the suffix).
-- CombinedInWordAccentThatHasSuffixOrChar: Inside a word and assumes
will be ended with a suffix. Evaluates
CombinedInWordAccentThatHasSuffix plus matches WordChar as the last
match.
-- CombinedAnyTimeSuffix: Any suffix (two types of suffixes: those
that are in a word, and those that are the entire word by themselves).
-- CombinedAccentSyllableOrChar: Evalutes general substitutions that
may occur anywhere (AccentSyllable) or matches the WordChar.
-- CombinedAccentSyllableOrCharThatWillBeFollowedByAnotherCharacter:
Assumes not at the end of a word. Like CombinedAccentSyllableOrChar
but also evaluates AccentFollowedBySyllableSubstitution because there
are following characters which allow this pattern to be matched.

-- ProcessedInWordAccentThatEndsWithSuffix: Intended to get all
patterns inside a word that end with a suffix. (Use of repetitions
expected.)
-- ProcessedInWordAccentThatEndsWithNoSuffix: Intended to get all
patterns inside a word that don't end with a suffix. (Use of
repetitions expected.) The pattern terminology used inside here is a
little screwed up. Even though internally it uses 'HasSuffix', its
intent is merely to make sure AccentFollowedBySyllableSubstitution
patterns are evaluated. This is done by explicitly having
'CombinedInWordAccent NotWord' at the end of the first pattern which
ensures there will be following characters.

-- ProcessedWord:
-- 1) exempt full words 'bork' and 'Bork' from being transformed.
-- 2) Transform all end of paragraphs with .?! to
-- \nBork Bork Bork!\n
-- 3) Handle words with both a prefix and suffix
-- 4) Handle words that are just a prefix by itself
-- 5) Handle words with no prefix and suffix
-- 6) Handle words with no prefix and no suffix
-- 7) Handle words that are just a special syllable by itself
-- 8) Handle works that are just a suffix by itself




So my result is very close to what Hagerman's original program
produces, but I have a few problems.

Hagerman Lex:
Velcume-a tu zee vunderffool vurld ooff zee Sveedeesh Cheff! Inclused
in thees ercheefe-a ere-a fuoor feeles:

My LPeg:
Velcume-a tu zee vunderffool vurld ooff zee Sveedeesh Cheff! Inclused
in thees ercheefe-a ere-a ffuoor feeles:


1) I have a very hard time dealing with suffixes. Suffixes are special
transforms for certain syllables that only happen when they come at
the very end of the word. One example where I have problems is words
that end with 'e'. The rule is that it should transform into 'e-a'.

So at the end of the above sentence: 'are four files', Hagerman gets
'ere-a fuoor feeles', but I get 'ere-a ffuoor feeles'
I am correctly getting the 'e-a' transformation, but my processing for
the next word is incorrect. The 'f' to 'ff' transformation is not
supposed to happen when at the beginning of a word.

The grammar in question is:
	ProcessedWord <-
CombinedAccentSyllableOrCharThatWillBeFollowedByAnotherCharacter
ProcessedInWordAccentThatEndsWithSuffix+

	ProcessedInWordAccentThatEndsWithSuffix <- CombinedAnyTimeSuffix NotWord
		/ CombinedInWordAccentThatHasSuffix
		/ WordChar

My belief is that because I put the NotWord in
'ProcessedInWordAccentThatEndsWithSuffix <- CombinedAnyTimeSuffix
NotWord', the space is being greedily eaten so my processing continues
onto the next word and my grammar doesn't see that the 'f' is at a
beginning of a new word.


In fact, in an earlier, more broken version, I had a grammar like:

	ProcessedWord <- CombinedAccentSyllableOrChar
CombinedInWordAccentThatHasSuffixOrChar+ CombinedAnyTimeSuffix

The 'CombinedInWordAccentThatHasSuffixOrChar+' didn't do what I wanted
because I believe the repetitions went greedily too far before I got
to the Suffix at the end. What I have now is my attempt to avoid that
problem, but I think I still have that same basic problem with
'NotWord'.


Is there a (better) way to solve this?


2) I am still confused by captures. At the end of a paragraph ending
in a period, question mark, or exclamation point, I am supposed to
inject 'Bork Bork Bork!'.

This seems to (mostly) work, though I am a little surprised it does:

	EndOfParagraphPunctuation <- [.!?]%nl -> '
Bork Bork Bork!
'
I thought I would need to capture the period/exclamation
point/question mark and substitute it back it, but I don't. However,
if I try to use a capture, I get a Lua error.


	EndOfParagraphPunctuation <- [.!?]%nl -> '%1
Bork Bork Bork!

bork.lua:268: invalid capture index (1)
stack traceback:
	[C]: in function 'match'
	bork.lua:268: in main chunk
	[C]: ?



3) Maybe this is asking too much of LPeg, but the Lex grammar has a
rule which keeps track of state. For the letter 'i' inside a word (not
beginning with), if it has not appeared in the word yet, change it to
'ee'. But if it has appeared already in the word, leave it alone as
'i'.

So 'Encheferizing' should become 'Incheffereezing' and not 'Incheffereezeeng'.

I was curious if there were any ideas if/how this could be done with
LPeg. (I'm thinking maybe not…as a very hand wavy argument, if Peg's
are related to Context Free Grammars, this looks like state/context,
so this is not context free, so it can't be done?)



Thanks,
Eric
-- 
Beginning iPhone Games Development
http://playcontrol.net/iphonegamebook/
Prev by Date: Re: Why isn't Lua more widely used?
Next by Date: purpose of lua
Previous by thread: Re: Encouraging good comments Which language ?
Next by thread: purpose of lua
Index(es):
- Date
- Thread