语法分析
上一篇文章中我们使用正则表达式可以将一段文本(代码)变成一个token list,但是我们如何能确定这个token list是否符合有效的语法规则呢?这就是语法分析所要解决的问题。
Syntatical Analysis
Noam Chomsky在1955年提出了一个Syntatic Structure的概念,即utterances have rules(Formal Grammars)。Formal Grmmar将token分成non-terminals和terminals。所谓terminal是指不能被继续替换的token,而non-termninal则可以被某种语法继续分解,比如下面例子
Sentence -> Subject Verb
Subject -> 'Teachers'
Subject -> 'Students'
Verb -> 'Write'
Verb -> 'Think'
其中带引号的单词为terminals,则我们可以将任意句子按照上面规则进行替换,直到句子中全部都是terminals,我们看一个简单的例子
Sentence -> Subject Verb -> 'Students' Verb -> 'Students' 'Write'
Sentence -> Subject Verb -> Subject 'Think' -> 'Teachers' 'Think'
我们称上面替换的结果为Derivation。如果我们再加一条规则
Subject -> Subject 'and' Subject
则上面的例子中产生出一个新的Derivation
Sentence -> Subject Verb -> Subject 'and' Subject Verb ->
'Students' 'and' Subject Verb -> 'Students' 'and' Subject 'Think' ->
'Students' 'and' 'Teachers' 'Think'
可见上述的替换过程是一个recursive的过程,我们指定的替换规则则叫做Recursive Grammar Rule。可见通过有限的Grammar rule以及递归产生无限的Derivation。
我们再来看一个实际的例子,假设我们有下面的Grammar Rule
stmt -> 'identifier' = exp
exp -> exp '+' exp
exp -> exp '-' exp
exp -> 'number'
则我们可以用这个rule来判断下面的表达式(statement)是否valid
x = 1
y = 2500 - 1
z = z+1
我们来看一下y
和z
的推导过程
y = exp ->
y = exp '-' exp ->
y = 2500 '-' 1
因此y
是valid的表达式,而z+1
不满足任意一条Grammar rule,因此z
不是一个valid的表达式。
另外,Grammar Rules比正则表达式更强大。比如解析数字,正则式为[0-9]+
,如果用Grammar rule来表示,则为
number -> 'digit' more_digits
more_digits -> 'digit' more_digits
more_digits -> ϵ
如果用Grammar rule来解析数字42,则推导过程为
number -> 'digit' more_digits -> 'digit' 'digit' more_digits
-> 'digit' 'digit' ϵ -> 'digit' 2 -> 42
Context-Free Grammar
实际上任何的正则表达式都可以用一种”语法“代替,我们看一个复杂一点的例子,假设有正则式r’p+i?’`,我们可用下面语法代替
regex -> pplus iopt
pplus -> 'p' pplus
pplus -> 'p'
iopt -> 'i'
iopt -> ϵ
我们称能用正则表达式解析的语言成为Regular Languages,而能用上述Grammar Rule解析的语言称为Context-Free Languages。而所谓的Context-Free指的是不论上下文的context如何,都不影响替换规则。
几种常见的正则表达式对应的Grammar如下
r'ab' = g -> 'ab'
r'a*' = g -> ϵ
g -> 'a' g
r'a|b' = g -> 'a'
g -> 'b'
实际上,有些场景我们是不能使用regular expression做parser的,比如前文提到的解析HTML文本。这是由于正则式无法检测到括号mismatch的情况,比如
<p>abc<b>def</b>gh</p>
<p>abc<b>def</p>gh</b>
显然HTML的parser必须要能检测出括号的闭合,因此对于第二种情况是需要报错的,但是正则式无法做到这一点。再比如回到文章开头括号匹配的例子,我们可以定义下面的语法
exp -> (exp)
exp -> ϵ
由于递归的存在,我们可以用上面两条语法来检测括号是否匹配。
Parse Tree
我们可以将任何一组Grammar Rules变成一个Parse Tree(或者AST),例如还是前面的Grammar Rule
exp -> exp + exp
exp -> exp - exp
exp -> num
假设我们的表达式为1+2-3
,对应的token list为[1, + , 2, -, 3]
。此时,符合该token list的一个Parser Tree可以表示为
exp
/ | \
exp - exp
/ | \ \
exp + exp num
/ | |
num num 3
| |
1 2
我们可以将token list中的元素从左到右依次填入到叶子节点,得到完整的parse tree
但是同样的token顺序,我们根据上面的语法规则还可以生成另外一个合法的Parse Tree
exp
/ | \
exp + exp
/ / | \
num exp - exp
| | |
1 num num
| |
2 3
此时,表达式计算的是1+(2-3)
,显然,这和我们希望的结果不符。这说明我们的语法规则式具有Ambiguity,即一个表达式会产生不止一个Parse Tree。实际上我们的Parser并不知道四则运算的从左到右的运算规则。一种解决办法是可以给我们的语法规则加一个括号的rule
exp -> (exp)
HTML Grammars
正如前面小节提到的,使用这则表达式不能帮助我们解析HTML文本,因此我们可以为HTML定义一个语法规则
html -> element html
html -> ϵ
element -> 'word'
element -> tag_open html tag-close
tag-open -> '<word>'
tag-close -> '</word>'
假设有一段HTML文本为 <p>welcome to <b>xta0</b> site</p>
,生成的Parse Tree为
html
/ \
ele html
/ | \ |
to html tc ϵ
/ / \ \
'<p>' ele html '</p>'
| / \
'welcome' ele html
| / \
'to' ele ϵ
/ | \
to html tc
| / \ |
'<b>' ele html'</b>'
| |
'xta0' ϵ
这种Tree structure看起来复杂,但对于计算机来说确实很有效的数据结构,我们后面会提到如何用代码生成Parse Tree。
Javascript Grammar
接下来我们来顶一个JavaScript的Grammar,我们先focus在一个简单函数
function up_to_ten(x){
if(x < 10){
return x;
} else {
return 10;
}
}
如果要支持parse上面的语法,我们先来定义如下的Grammar Rules
exp -> 'identifier'
exp -> 'number'
exp -> 'string'
exp -> 'TRUE'
exp -> 'FALSE'
exp -> exp '+' exp
exp -> exp '-' exp
exp -> exp '*' exp
exp -> exp '/' exp
exp -> exp '==' exp
exp -> exp '<' exp
exp -> exp '&&' exp
stmt -> 'identifier' = exp
stmt -> 'return' exp
stmt -> 'if' exp compound_stmt
stmt -> 'if' exp compound_stmt 'else' compound_stmt
compound_stmt -> '{' stmts '}'
stmts -> stmt ';' stmts
stmts -> ϵ
有了这些规则,我们可以顺利的解析一些expression,但是对于像JavaScript或者Python这类高级语言,除了expression之外,还有function,因此我们需要顶一个关于解析函数的Grammar Rule
js -> element js
js -> ϵ
# function definination
element -> 'function identifier(' opt_params ')' compound_stmt
element -> stmt;
opt_params -> params
opt_params -> ϵ
params -> 'identifier', params
params -> 'identifier'
# function call
exp -> ...
exp -> 'identifier(' opt_args) ')'
opt_args -> args
opt_args -> ϵ
args -> exp ',' args
args -> exp
Grammar in Python
上面提到的这些Grammar Rule该如何用代码表示呢,我们以Python为例,还是上面提到的语法规则
exp -> exp + exp
exp -> exp - exp
exp -> (exp)
exp -> num
我们可以用下面规则来实现语法规则
A -> B C
#python tupe
("A", ["B", "C"])
则对应的Python代码为
grammar = [
("exp", ["exp", "+", "exp"]),
("exp", ["exp", "-", "exp"]),
("exp", ["(", "exp", ")"]),
("exp", ["num"]),
]
现在假设我们有一个token list - ['print', exp, ;]
,我们需要将exp
替换为其中一条Grammar Rule,假设我们用第一条 exp -> exp + exp
,则替换后的结果为
['print', 'exp', '+', 'exp', ';']
我们可以将exp
不断的进行递归替换,直到数组中每个元素都是叶子节点为止
Earley Parser (Shift-Reduce Parser)
接下来,我们来介绍一种基于chart
的parsing technique - Earley Parser
我们需要定义一个chart
字典,其中的元素是一个数组用来保存条parsing state,数组中不允许有重复的state,因此它是一个ordered set。
# chart[index] returns a list that contains state exactly
# once. The chart is a Python dictionary and index is a number. addtochart
# should return True if something was actually added, False otherwise. You may
# assume that chart[index] is a list.
def addtochart(chart, index, state):
if state not in chart[index]:
chart[index] = [state] + chart[index]
return True
else:
return False
我们python的tuple来代码来表示一条state
# x -> ab . cd from j
state = ("x", ["a", "b"], ["c", "d"], j)
Closure
我们可以对token list进行从左向右遍历,每遇到一个token,我们来检查它是否是non-terminal,如果是,则进行语法规则替换,这个过程称为computing the closure或者叫predicting。我们来看一个例子
假设当前的Grammar如下,它包含下面几条rewrite rules
E -> E - E
E -> E + E
E -> (E)
E -> 'num'
T -> 'I like t'
T -> ϵ
假设输入的token为 ['a','b','c','d']
,当我们paser到 ab
时,下一个token为c
,此时我们需要看c
是否满足某条规则,如果是,则将其按照rewrite rule进行替换
# We are currently looking at chart[i] and we see x => ab . cd from j
# Write the Python procedure, closure, that takes five parameters:
# grammar: the grammar using the previously described structure
# i: a number representing the chart state that we are currently looking at
# x: a single nonterminal
# ab and cd: lists of many things
# The closure function should return all the new parsing states that we want to
# add to chart position i
grammar = [
("exp", ["exp", "+", "exp"]),
("exp", ["exp", "-", "exp"]),
("exp", ["(", "exp", ")"]),
("exp", ["num"]),
("t",["I","like","t"]),
("t",[""])
]
def closure (grammar, i, x, ab, cd):
next_states = [
(rule[0], [], rule[1], i)
for rule in grammar
if len(cd) > 0 and rule[0] == cd[0]
]
next_states = closure(grammar, i, x, ab, cd)
for next_state in next_states:
any_changes = addtochart(chart, i, next_state)
Shift
所谓shift是指当前token如果是一个terminate token,则skip该token,继续向右parse
# Writing Shift
# We are currently looking at chart[i] and we see x => ab . cd from j. The input is tokens.
# The procedure, shift, should either return None, at which point there is
# nothing to do or will return a single new parsing state that presumably
# involved shifting over the c if c matches the ith token.
def shift (tokens, i, x, ab, cd, j):
if len(cd) == 0 and tokens[i] == cd[0]:
return (x, ab + [cd[0]], cd[1:], j)
else:
return None
next_state = shift(tokens, i, x, ab, cd, j)
if len(next_state) > 0:
any_changes = addtochart(chart, i+1, next_state)
Reduction
Reduction是将已经存在的
# Writing Reductions
# We are looking at chart[i] and we see x => ab . cd from j.
# you only want to do reductions if cd == []
def reductions(chart, i, x, ab, cd, j):
next_states = reductions(chart, i, x, ab, cd, j)
for next_state in next_states:
any_changes = addtochart(chart, i, next_state)