To build a parser you must provide:
After a parser is generated it can be used for parsing strings of the language.
We assume familiarity with some basic notions of formal languages as regular expressions, context-free grammars, LR grammars and LR parsing [ASU86,HMU00,GJ90]. Some knowledge of the Python language [Lut96] is also required.
The class Lexer implements a lexical analyser based on Python regular expressions. An instance must be initialised with a tokenizer, which is, a list of tuples:
(re,funct,op?)
where:
Restriction: if a keyword is substring of another its rule must appear after the larger keyword for the obvious reasons...The following list presents a tokenizer for regular expressions:
l = [("\s+",""), ("@epsilon",lambda x: (x,x)), ("@empty_set",lambda x: (x,x)), ("[A-Za-z0-9]",lambda x: ("id",x)), ("[+]",lambda x: ("+",x),("+",100,'left')), ("[*]",lambda x: (x,x),("*",300,'left')), ("\(|\)",lambda x: (x,x)) ]
A lexical analyser is created by instantiating a Lexer class:
>>> from yappy.parser import * >>> a=Lexer(l)
>>> from yappy.parser import * >>> a=Lexer(l) >>> a.scan("(a + b)* a a b*") [('(', '('), ('id', 'a'), ('+', '+'), ('id', 'b'), (')', ')'), ('*', '*'), ('id', 'a'), ('id', 'a'), ('id', 'b'), ('*', '*')] >>>
See Yappy documentation for more details.
The class LRParser implements a LR parser generator. An instance must be initialised with:
(LeftHandSide,RightHandSide,SemFunc,Prec)with
Restriction: The first production is for the start symbol.
Here is an unambiguous grammar for regular expressions:
grammar = [("r",["r","|","c"],self.OrSemRule), ("r",["c"],self.DefaultSemRule), ("c",["c","s"],self.ConcatSemRule), ("c",["s"],self.DefaultSemRule), ("s",["s","*"],self.StarSemRule), ("s",["f"],self.DefaultSemRule), ("f",["b"],self.DefaultSemRule), ("f",["(","r",")"],self.ParSemRule), ("b",["id"],self.BaseSemRule), ("b",["@empty_set"],self.BaseSemRule), ("b",["@epsilon''],self.BaseSemRule)]
The previous description can be easily rephrased in a more user-friendly manner. We provide two ways:
Word -> Word1 ... Word2
or Word -> []
(for empty
rules). The rule symbol and the
separator of the RHS words can be specified, default values
are -> and whitespaces (i.e \s
python
regular expression). If no semantic rules, DefaultSemRule
is assumed.
The previous grammar can be rewritten as:
grammar = grules([("r -> r | c",self.OrSemRule), ("r -> c",self.DefaultSemRule), ("c -> c s",self.ConcatSemRule), ("c -> s",self.DefaultSemRule), ("s -> s *",self.StarSemRule), ("s -> f",self.DefaultSemRule), ("f -> b",self.DefaultSemRule), ("f -> ( r )",self.ParSemRule), ("b -> id",self.BaseSemRule), ("b -> @empty_set",self.BaseSemRule), ("b -> @epsilon",self.BaseSemRule)])
We can also write an ambiguous grammar if we provided precedence information, that allows to solve conflicts (shift-reduce).
grammar = grules([("reg -> reg | reg",self.OrSemRule), ("reg -> reg reg",self.ConcatSemRule,(200,'left')), ("reg -> reg *",self.StarSemRule), ("reg -> ( reg )",self.ParSemRule), ("reg -> id",self.BaseSemRule), ("reg -> @empty_set",self.BaseSemRule), ("reg -> @epsilon",self.BaseSemRule) ])
grammar =""" reg -> reg + reg {{ self.OrSemRule }} | reg reg {{ self.ConcatSemRule }} // 200 left | reg * {{ self.StarSemRule }} | ( reg ) {{self.ParSemRule }} | id {{ self.BaseSemRule }}; """where:
The separators can be redefined in the tokenizer of
Yappy_grammar class. An empty rule can be reg ->;
. If no semantic rule is given, DefaultSemRule is assumed.
See Yappy documentation for more details.
As usual the semantic value of an expression will be a function of the semantic values of its parts. The semantics of a token is defined by the tokenizer 2.1. The semantic actions for grammar rules are specified by Pythonfunctions that can be evaluated in a given context. Our approach is essentially borrowed from the kjParsing package [rs00]: a semantic function takes as arguments a list with the semantic values of the RightHandSide of a rule and a context and returns a value that represents the meaning of the LeftHandSide and performs any side effects to the context.
For instance, by default the semantic value of a rule can be the semantic value of the first element of the RightHandSide:
def DefaultSemRule(list,context=None): """Default semantic rule""" return list[0]
Assuming the definition of some objects for regular expressions, trivial semantic rules for printing regular expressions can be:
def OrSemRule(self,list,context): return "%s+%s" %(list[0],list[2]) def ConcatSemRule(self,list,context): return list[0]+list[2] def ParSemRule(self,list,context): return "(%s)" %list[1] def BaseSemRule(self,list,context): return list[0] def StarSemRule(self,list,context): return list[0]+'*'
Semantic actions can also be more Bison like, if they are a string
where $n
represents the semantics of its argments. For
instance: E-> E -> E + T {{ "Sum($0,$2)"}};
.
No error recovery is currently implemented. Errors are reported with rudimentary information, see the exception error classes in Yappy documentation.
>>>from yappy.parser import * >>>parse = LRparser(grammar,table,no_table,tabletype,operators)
Some information about LR table generated can be retrieved, by printing some attributes:
>>> print parse.cfgr 0 | ('reg', ['reg', '+', 'reg'], RegExp2.OrSemRule, ('100', 'left')) 1 | ('reg', ['reg', 'reg'],RegExp2.ConcatSemRule, ('200', 'left')) 2 | ('reg', ['reg', '*'],RegExp2.StarSemRule) 3 | ('reg', ['(', 'reg', ')'], RegExp2.ParSemRule) 4 | ('reg', ['id'], RegExp2.BaseSemRule ) 5 | ('@S', ['reg'], DefaultSemRule) >>> print parse Action table: State + * ( ) id $ # 0 s1 s2 1 s1 s2 2 r4 r4 r4 r4 r4 r4 3 s5 s6 s1 s2 r[] 4 s5 s6 s1 s8 s2 5 s1 s2 6 r2 r2 r2 r2 r2 r2 7 r1 s6 r1 r1 s2 r1 8 r3 r3 r3 r3 r3 r3 9 r0 s6 r0 r0 s2 r0 Goto table: State reg @S 0 3 1 4 3 7 4 7 5 9 7 7 9 7
If _DEBUG is set, several comments are printed during the table construction, in particular the collection of LR items.
If any of these conflicts occurs, a list of the resolved conflicts are listed and more information can be found in the Log attribute. The Log has the following attributes:
Currently no prettyprinting is available for these values.
>>>parse.parsing(a.scan("(a+b)*aab*"))
The attribute output records the grammar rules that were applied for parsing the string:
>>>parse.output [4, 4, 0, 3, 2, 4, 1, 4, 1, 4, 1, 2]If _DEBUG is set, it is possible to see each application of a table action and the values in the stack.
The Yappy class is a wrapper for defining a parser and for parsing. Basically it creates the lexical analyser and the parser. This class is a subclass of LRparser and can also define the Directories where the parsing tables are stored:
It defines the following I/O functions:
Here is a complete parser for regular expressions:
from yappy.parser import * class ParseReg(Yappy): def __init__(self,no_table=0, table='tablereg'): grammar =""" reg -> reg + reg {{ self.OrSemRule }} | reg reg {{ self.ConcatSemRule }} // 200 left | reg * {{ self.StarSemRule }} | ( reg ) {{self.ParSemRule}} | id {{ self.BaseSemRule}} | @empty_set {{ self.BaseSemRule}} | @epsilon {{ self.BaseSemRule}} | ; """ tokenize = [ ("\s+",""), ("@epsilon",lambda x: ("id",x))), ("@empty_set",lambda x: ("id",x)), ("[A-Za-z0-9]",lambda x: ("id",x)), ("[+]",lambda x: ("+",x),("+",100,'left')), ("[*]",lambda x: (x,x),("*",300,'left')), ("\(|\)",lambda x: (x,x)) ] Yappy.__init__(self,tokenize,grammar,table,no_table) ##Semantic rules build a parse tree... def OrSemRule(self,list,context): return "(%s+%s)" %(list[0],list[2]) def ConcatSemRule(self,list,context): return "(%s%s)" %(list[0],list[2]) def ParSemRule(self,list,context): return "(%s)" %list[1] def BaseSemRule(self,list,context): return list[0] def StarSemRule(self,list,context): return "(%s*)" %list[0]
An instance is used as:
>>> d = ParseReg() >>> d.input("(a+b)*aab*") >>> (a+b)*aab*
See Yappy documentation for more details.
Nelma Moreira, Rogério Reis 2006-07-19