Syntax and Semantics
Introduction
Who must use PL definitions ?
- Language designers
- Language implementers
- Programmers (users of the PL)
Syntax is the form or structure of the expressions,
statements and program units.
Semantics the meaning of the expressions, statements
and program units.
Languages - terminology
- A sentence is a string of characters over some
alphabet, Σ
- A language is set of sentences
- A lexicon is a set of all grammatical categories
in a language
- A token is a particular syntactic unit from a
particular grammatical category (e.g. a specific identifier, operator,
keyword)
- the lexicon defines the legal tokens
- A lexeme is the lowest level syntactic unit of
a language (e.g. total, + , begin)
- it is the character string that matches a specific token
Example: index = 2 * count + 17; (sentence)
lexeme |
token |
index |
identifier |
= |
equal_sign |
2 |
int_literal |
* |
mult_op |
count |
identifier |
+ |
plus_op |
17 |
int_literal |
; |
semicolon |
Languages - formal definition
There are two formal ways to define languages, recognizer
and generator.
- Recognizers, R
- Given a language L and an alphabet
Σ
- R reads in a string over
Σ
- And determines whether the string is in
L or not
- Used in compilers. The syntax analysis
part of a compiler
is a recognizer for the language the compiler translates.
- Generators, G
- G generates sentences of L
- Specified as grammars
- Close relationship between recognizers and generators
We will be concerned with generators.
Formal Methods of Specifying Language Generators or Grammars
- Context-free Grammar (CFG)
- Developed by Noam Chomsky in the mid-1950s
- Four types of languages
Generator (grammar)
Recognizer
- Backus-Naur Form (BNF)
- Invented by John Backus to describe Algol 58 (1959)
- Equivalent to context-free grammars
- Extended Backus-Naur Form
- Modifications to BNF to aid conciseness of expression
- Syntax Diagrams
Context-Free Grammar
Formally, a language L is defined in terms of a
quadruple L(T,N,P,S) where:
- T stands for terminal symbols
- N for nonterminals
- P for the productions
- S (one of the symbols in N) the start symbol
L(T,N,P,S) is then the set of all sequences of terminal
symbols, νi
, which can be generated from S according to a specific
set of rules.
Mathematically: L =
{ν : S →
ν and ν Є T*
}
At about the same time that BNF was being
devised, the linguist Noam Chomsky was developing the theory of grammars. One
special type of grammar he identified is called a context-free
grammar, or cfg.
A language is said to be context-free iff it can be defined in terms of a context-free set of productions. The
productions are themselves context free iff each LHS consists
of a single nonterminal symbol, X, which can be replaced by the RHS,
regardless of the symbols which immediately precede or follow X.
It turns out that this concept is equivalent to
languages defined by BNF.
Backus-Naur Form (BNF)
BNF is an example of a metalanguage, i.e. a language
used to describe another language
As discussed earlier, BNF defines a Language,
L in terms of a quadruple:
- a set of production rules, R
- a set of terminal symbols, T
- a set of non-terminal symbols, N
- a start symbol, S Є N
Each production in R has the following form:
- A ::= ω
- where A Є N and
ω
Є (N U T)*
- where * means 0 or more occurrences
In BNF, these non-terminals are abstractions used to represent classes of
syntactic structures
and act like like syntactic variables.
- For example:
<assign> ::= <var> = <expression>
is a rule which describes the structure of an
assignment statement
- A rule has a left-hand side (LHS) and a
right-hand side (RHS), and consists of
terminal and nonterminal symbols
- Lexemes are terminals
- An abstraction (or nonterminal symbol) can have more than one RHS
<if_stmt> ::= if <logic_expr> then <stmt> | if <logic_expr>
then <stmt> else <stmt>
- Syntactic lists are described using recursion
<ident_list> ::= identifier | identifier, <ident_list>
Simple BNF Example 1 - Definition of a binary number
- T = { 0, 1 }
- N = { binary Digit, binaryNumber }
- S = {binaryNumber}
- R = { binaryDigit ::= 0
binaryDigit ::= 1
binaryNumber ::=
binaryDigit
binaryNumber ::=
binaryDigit binaryNumber
}
- or,
<binaryNumber> ::= <binaryDigit> | <binaryDigit> <binaryNumber>
<binaryDigit> ::= 0 | 1
Simple BNF Example 2 - Grammar for a Small Language
- <program> ::= begin <stmt_list> end
- <stmt_list> ::= <stmt>
| <stmt>; <stmt_list>
- <stmt> ::= <var> = <expression>
- <var> ::= A | B | C
- <expression> ::= <var> + <var>
| <var> - <var>
| <var>
Uses of BNF
- Generation
- In theory, BNF allows generation of all
legal expressions
- Problem: Most languages are infinite
- Recognition
- Allows expressions to be recognized
- Clearly, this is most common use
Recognition Using BNF
- How can we prove that a grammar accepts a particular sentence?
- Derivation - find a path from start symbol to all terminals
- Parse tree - graphical representation of a derivation
Derivation
- A derivation is a repeated application of rules, starting with the start symbol
and
ending with a sentence (all terminal symbols)
- Example 1 - Binary Number:
Recognition of "010"
<binaryNumber> |
::= |
<binaryDigit> <binaryNumber> |
|
::= |
0 <binaryNumber> |
|
::= |
0 <binaryDigit> <binaryNumber> |
|
::= |
0 1 <binaryNumber> |
|
::= |
0 1 <binaryDigit> |
|
::= |
0 1 0 |
- Example 2 - Program Recognition: A=B+C; B=C
<program> |
::= |
begin <stmt_list> end |
|
::= |
begin <stmt>; <stmt_list> end |
|
::= |
begin <var> = <expression>; <stmt_list> end |
|
::= |
begin A = <expression>; <stmt_list> end |
|
::= |
begin A = <var> + <var>; <stmt_list> end |
|
::= |
begin A = B + <var>; <stmt_list> end |
|
::= |
begin A = B + C; <stmt_list> end |
|
::= |
begin A = B + C; <stmt> end |
|
::= |
begin A = B + C; <var> = <expression> end |
|
::= |
begin A = B + C; B = <expression> end |
|
::= |
begin A = B + C; B = <var> end |
|
::= |
begin A = B + C; B = C end |
- Every string of symbols in the derivation is a sentential form
- A sentence is a sentential form that has only terminal symbols
- A leftmost (or rightmost) derivation is one in which the leftmost (or
rightmost)
nonterminal is replaced
- A derivation may be either leftmost nor rightmost
Parse Tree
- A grammar naturally describes the hierarchical syntactic structure of
sentences of the language
- This hierarchical syntactic structure is called parse tree
- A parse tree is a tree such that
- Each leaf node is labeled with a terminal
- Each non-leaf is labeled with a nonterminal
- Label of parent is LHS of production and
label of each child is, left to right, the RHS of production
- Root is labeled with starting nonterminal
- Example 1 - Parse Tree for "010" using BNF Grammar specified
above
- Example 2 - Simple Assignment Statement Grammar
<assign> ::= <id> = <expr>
<id> ::= A | B | C
<expr> ::= <id> + <expr>
| <id> * <expr>
| (<expr>)
| <id>
- Example 2 - Leftmost Derivation of A = B * (A + C)
<assign> |
::= |
<id> = <expr> |
|
::= |
A = <expr> |
|
::= |
A = <id> * <expr> |
|
::= |
A = B * <expr> |
|
::= |
A = B * (<expr>) |
|
::= |
A = B * (<id> + <expr>) |
|
::= |
A = B * (A + <expr>) |
|
::= |
A = B * (A + <id>) |
|
::= |
A = B * (A + C) |
- Example 2 - Parse Tree
saa
Extended Backus-Naur Form (EBNF)
- EBNF makes BNF more convenient to use
- Notation used in EBNF
- Basic BNF
- Curly Brackets { } - used to show grouping
- Kleene Star * - Zero or more instances of previous element
- Kleene PLus + - One or more instances of previous element
- Stack | - Alternatives
- Brackets [ ] - optional
- Example
Syntax Diagrams
- Example