Syntax and Semantics

Introduction

Who must use PL definitions ?

Language designers
Language implementers
Programmers (users of the PL)

Syntax is the form or structure of the expressions, statements and program units.

Semantics the meaning of the expressions, statements and program units.

Languages - terminology

A sentence is a string of characters over some alphabet, Σ
A language is set of sentences
A lexicon is a set of all grammatical categories in a language
A token is a particular syntactic unit from a particular grammatical category (e.g. a specific identifier, operator, keyword)
- the lexicon defines the legal tokens
A lexeme is the lowest level syntactic unit of a language (e.g. total, + , begin)
- it is the character string that matches a specific token

Example: index = 2 * count + 17; (sentence)

lexeme	token
index	identifier
=	equal_sign
2	int_literal
*	mult_op
count	identifier
+	plus_op
17	int_literal
;	semicolon

Languages - formal definition

There are two formal ways to define languages, recognizer and generator.

Recognizers, R
- Given a language L and an alphabet Σ
- R reads in a string over Σ
- And determines whether the string is in L or not
- Used in compilers. The syntax analysis part of a compiler
  is a recognizer for the language the compiler translates.
Generators, G
- G generates sentences of L
- Specified as grammars
- Close relationship between recognizers and generators

We will be concerned with generators.

Formal Methods of Specifying Language Generators or Grammars

Context-free Grammar (CFG)
- Developed by Noam Chomsky in the mid-1950s
- Four types of languages
  
  Generator (grammar) Recognizer
  - Type 0 unrestricted grammar TM (Turing Machine)
  - Type 1 context-sensitive grammar LBA (Linear-bounded automata)
  - Type 2 context-free grammar PDA (Pushdown automata)
  - Type 3 regular grammar FA (Finite Automata)
Backus-Naur Form (BNF)
- Invented by John Backus to describe Algol 58 (1959)
- Equivalent to context-free grammars
Extended Backus-Naur Form
- Modifications to BNF to aid conciseness of expression
Syntax Diagrams
- traversal based

Context-Free Grammar

Formally, a language L is defined in terms of a quadruple L(T,N,P,S) where:

T stands for terminal symbols
N for nonterminals
P for the productions
S (one of the symbols in N) the start symbol

L(T,N,P,S) is then the set of all sequences of terminal symbols, ν_i , which can be generated from S according to a specific set of rules.

Mathematically: L = {ν : S → ν and ν Є T* }

At about the same time that BNF was being devised, the linguist Noam Chomsky was developing the theory of grammars. One special type of grammar he identified is called a context-free grammar, or cfg.

A language is said to be context-free iff it can be defined in terms of a context-free set of productions. The productions are themselves context free iff each LHS consists of a single nonterminal symbol, X, which can be replaced by the RHS, regardless of the symbols which immediately precede or follow X.

It turns out that this concept is equivalent to languages defined by BNF.

Backus-Naur Form (BNF)

BNF is an example of a metalanguage, i.e. a language used to describe another language

As discussed earlier, BNF defines a Language, L in terms of a quadruple:

a set of production rules, R
a set of terminal symbols, T
a set of non-terminal symbols, N
a start symbol, S Є N

Each production in R has the following form:

A ::= ω
where A Є N and ω Є (N U T)*
where * means 0 or more occurrences

In BNF, these non-terminals are abstractions used to represent classes of syntactic structures
and act like like syntactic variables.

For example:

<assign> ::= <var> = <expression>

is a rule which describes the structure of an assignment statement
A rule has a left-hand side (LHS) and a right-hand side (RHS), and consists of
terminal and nonterminal symbols
Lexemes are terminals
An abstraction (or nonterminal symbol) can have more than one RHS

<if_stmt> ::= if <logic_expr> then <stmt> | if <logic_expr> then <stmt> else <stmt>
Syntactic lists are described using recursion

<ident_list> ::= identifier | identifier, <ident_list>

Simple BNF Example 1 - Definition of a binary number

T = { 0, 1 }
N = { binary Digit, binaryNumber }
S = {binaryNumber}
R = { binaryDigit ::= 0
          binaryDigit ::= 1
          binaryNumber ::= binaryDigit
          binaryNumber ::= binaryDigit binaryNumber
       }
or,
<binaryNumber> ::= <binaryDigit> | <binaryDigit> <binaryNumber>
<binaryDigit> ::= 0 | 1

Simple BNF Example 2 - Grammar for a Small Language

<program> ::= begin <stmt_list> end
<stmt_list> ::= <stmt>
| <stmt>; <stmt_list>
<stmt> ::= <var> = <expression>
<var> ::= A | B | C
<expression> ::= <var> + <var>
| <var> - <var>
| <var>

Uses of BNF

Generation
- In theory, BNF allows generation of all legal expressions
- Problem: Most languages are infinite
Recognition
- Allows expressions to be recognized
- Clearly, this is most common use

Recognition Using BNF

How can we prove that a grammar accepts a particular sentence?
- Derivation - find a path from start symbol to all terminals
- Parse tree - graphical representation of a derivation

Derivation

A derivation is a repeated application of rules, starting with the start symbol
and ending with a sentence (all terminal symbols)

Example 1 - Binary Number: Recognition of "010"

<binaryNumber>	::=	<binaryDigit> <binaryNumber>
	::=	0 <binaryNumber>
	::=	0 <binaryDigit> <binaryNumber>
	::=	0 1 <binaryNumber>
	::=	0 1 <binaryDigit>
	::=	0 1 0

Example 2 - Program Recognition: A=B+C; B=C

<program>	::=	begin <stmt_list> end
	::=	begin <stmt>; <stmt_list> end
	::=	begin <var> = <expression>; <stmt_list> end
	::=	begin A = <expression>; <stmt_list> end
	::=	begin A = <var> + <var>; <stmt_list> end
	::=	begin A = B + <var>; <stmt_list> end
	::=	begin A = B + C; <stmt_list> end
	::=	begin A = B + C; <stmt> end
	::=	begin A = B + C; <var> = <expression> end
	::=	begin A = B + C; B = <expression> end
	::=	begin A = B + C; B = <var> end
	::=	begin A = B + C; B = C end

Every string of symbols in the derivation is a sentential form
A sentence is a sentential form that has only terminal symbols
A leftmost (or rightmost) derivation is one in which the leftmost (or rightmost)
nonterminal is replaced
A derivation may be either leftmost nor rightmost

Parse Tree

A grammar naturally describes the hierarchical syntactic structure of sentences of the language
This hierarchical syntactic structure is called parse tree
A parse tree is a tree such that
- Each leaf node is labeled with a terminal
- Each non-leaf is labeled with a nonterminal
- Label of parent is LHS of production and
  label of each child is, left to right, the RHS of production
- Root is labeled with starting nonterminal
Example 1 - Parse Tree for "010" using BNF Grammar specified above
Example 2 - Simple Assignment Statement Grammar

<assign> ::= <id> = <expr>
<id> ::= A | B | C
<expr> ::= <id> + <expr>
                | <id> * <expr>
                | (<expr>)
                | <id>

Example 2 - Leftmost Derivation of A = B * (A + C)

<assign>	::=	<id> = <expr>
	::=	A = <expr>
	::=	A = <id> * <expr>
	::=	A = B * <expr>
	::=	A = B * (<expr>)
	::=	A = B * (<id> + <expr>)
	::=	A = B * (A + <expr>)
	::=	A = B * (A + <id>)
	::=	A = B * (A + C)

Example 2 - Parse Tree

saa

Extended Backus-Naur Form (EBNF)

EBNF makes BNF more convenient to use
Notation used in EBNF
- Basic BNF
- Curly Brackets { } - used to show grouping
- Kleene Star * - Zero or more instances of previous element
- Kleene PLus + - One or more instances of previous element
- Stack | - Alternatives
- Brackets [ ] - optional
Example

Syntax Diagrams

Example