Chapter 10 - Compilers and Language Translation
(10.1) - Introduction
Before it can be executed, a program written in a high-level language
must be translated into machine language by a system software module
called a compiler. Compilers are more complex than the
assemblers we examined in section 6.3.3 but a number of techniques
have been developed to simplify the compiler-writing problem. The
Structure of Compilers course, CS 43311, covers these
techniques - here we only describe the basic ideas.
(10.2) - The Compilation Process
The typical compiler has four phases which are described below.
- Phase I: Lexical Analysis. Scan the characters of
the input source program, group them into meaningful units
tokens, and classify the tokens. The process is akin to
reading the characters of an English sentence, grouping them into
words, and classifying each word as a noun,
verb, adjective,
etc.
- Phase II: Parsing. Check the tokens produced by Phase I
to see if they satisfy the syntax of the programming language.
This is akin to checking that the words in English form a
grammatically-correct sentence.
- Phase III: Semantic Analysis and Code Generation.
Analyze the meaning of the syntactically-correct input and
generate the machine code to perform the necessary actions.
- Phase IV: Code Optimization. Check the generated
code to see if it can be made to run faster.
(10.2.1) Phase I: Lexical Analysis
The scanner or lexical analyzer scans the characters
in the source program, groups them into single, indivisible objects
called tokens and classifies the tokens. As an example we
consider the following statement in C++:
Omega = 60 * (Beta + Gamma) ;
This statement has 32 characters including the seven space characters.
The scanner ignores the spaces, groups the other characters
into 10 tokens, and classifies them as shown in the following table:
| Lexeme | Classification |
| Omega | variable |
| = | assign-op |
| 60 | number |
| * | mul-op |
| ( | open-parenthesis |
| Beta | variable |
| + | add-op |
| Gamma | variable |
| ) | close-parenthesis |
| ; | semicolon |
Another action of the scanner is to build a symbol table
containing the lexemes of every variable it finds - in this example
the symbol table contains Omega, Beta, and Gamma.
The scanner also builds a number table which in this example
contains 60.
(10.2.2) Phase II: Parsing
The parser assembles the tokens found by the scanner together in a way that
satisfies the grammar of the source language. The grammar of the source
language is specified by a set of productions expressed in
Backus-Naur Form (BNF). The left-side of each production is the name of
a grammatical construct, followed by a ::= symbol which means is
defined as, followed by a string of grammatical constructs on the right-side.
For example the following table shows 10 productions that are part of the
grammar for C++:
| (1) | <statement> | ::= |
<variable> = <expression> ; |
| (2) | <expression> | ::= | <term> |
| (3) | <expression> | ::= |
<expression> + <term> |
| (4) | <expression> | ::= |
<expression> - <term> |
| (5) | <term> | ::= | <factor> |
| (6) | <term> | ::= |
<term> * <factor> |
| (7) | <term> | ::= |
<term> / <factor> |
| (8) | <factor> | ::= |
( <expression> ) |
| (9) | <factor> | ::= | <variable> |
| (10) | <factor> | ::= | <number> |
Production (1) in the table says that a <statement> is a <variable> followed by the assignment operator, =,
followed by an <expression>, followed by a semicolon.
Productions (2), (3), and (4) in the table say that an
<expression> is:
- a single <term> or
- an <expression> followed by a plus-sign
followed by a <term> or
- an <expression> followed by a minus-sign
followed by a <term>.
Productions (5), (6), and (7) in the table say that a <term> is:
- a single <factor> or
- a <term> followed by a times-sign
followed by a <factor> or
- a <term> followed by a divide-sign
followed by a <factor>.
Productions (8), (9) ,and (10) in the table say that <factor>
is:
- an <expression> enclosed in parentheses or
- a <variable> or
- a <number>.
The parser uses the productions in the grammar to
build a parse-tree from the tokens supplied by the scanner.
For example, the parse tree for:
Omega = 60 * (Beta + Gamma) ;
is shown below with each step showing the number of the production
used in that step.
The parser gets the correct parse-tree for this sequence of
tokens as follows:
- The addition enclosed by the parentheses must be done
first and only the right-side of production (3) contains the
plus-sign. In order to use the right-side of production (3) the
Beta variable must become an expression and the
Gamma variable must become a term so productions
(9), (5), and (2) are performed before production (3).
- Only the right-side of production (8) contains
parentheses so it is performed next.
- Only the right-side of production (6) contains the
times-sign.
In order to use the right-side of production (6) the number
60 must become a term so productions
(10) and (5) are performed before production (6).
- Production (1) can now be performed.
Ambiguity: The grammar in the table
is unambiguous - there is only one parse-tree for
any expression.
- All multiplications and divisions must be performed
before any additions and subtractions can be performed
because:
- an expression can only be a single term, a sum of terms,
and/or a difference of terms; and
- a term can only be a single factor, a
product of factors, and/or a quotient of factors.
- But Production (8) says that any operations enclosed inside
parentheses must be performed before any operations outside
the parentheses can be performed.
- Additions and subtractions must be performed in left-to-right
order since an expression can't be a term followed by a
plus-sign or minus-sign followed by an expression. For example,
the subtraction in 9 - 5 + 2 must be performed before
the addition to get the correct value of 6 - performing the
addition before the subtraction will get the wrong value of 2.
(10.2.3) Phase III: Semantic Analysis and Code Generation
Semantic Analysis:
The parser only examines the syntax of the source language, i.e.,
does the sequence of tokens look correct. Semantic
analysis makes sure that the meaning is correct. For
example, the English sentence, The man bit the dog., has
the correct syntax as shown by the following parse tree:
but the semantics is questionable - the man bit the dog
instead of the dog biting the man.
Semantic analysis for programming languages makes sure:
- that all variables have the correct data types;
- that the argument-list and return value of every function-call
agrees with the definition of the called function; etc.
Code Generation: Code can be generated as the productions
of the grammar are performed. The code generated when the statement:
Omega = 60 * (Beta + Gamma) ;
is treated is shown in the following table along with the
number of the production that generated it - the other productions
used in the parse-tree don't generate any code.
| Generated Code | Generator |
LOAD Beta
ADD Gamma
STORE Temp1
| Production (3) |
LOAD Temp1
MULT N60
STORE Temp2
| Production (6) |
LOAD Temp2
STORE Omega
| Production (1) |
(10.2.4) Phase IV: Code Optimization
Code optimization simplifies the code by eliminating
wasteful instructions. In our example, the code generated
by Phase III is simplified to be:
LOAD Beta
ADD Gamma
MULT N60
STORE Omega
Then the symbol table and number table are used to
allocate memory for all variables and numbers:
Omega: .DATA 0
N60: .DATA 60
Beta: .DATA 0
Gamma: .DATA 0
Kenneth E. Batcher - 11/8/2006