Chapter 10 - Compilers and Language Translation

(10.1) - Introduction

Before it can be executed, a program written in a high-level language must be translated into machine language by a system software module called a compiler. Compilers are more complex than the assemblers we examined in section 6.3.3 but a number of techniques have been developed to simplify the compiler-writing problem. The Structure of Compilers course, CS 43311, covers these techniques - here we only describe the basic ideas.

(10.2) - The Compilation Process

The typical compiler has four phases which are described below.

(10.2.1) Phase I: Lexical Analysis

The scanner or lexical analyzer scans the characters in the source program, groups them into single, indivisible objects called tokens and classifies the tokens. As an example we consider the following statement in C++:

        Omega = 60 * (Beta + Gamma) ;
This statement has 32 characters including the seven space characters. The scanner ignores the spaces, groups the other characters into 10 tokens, and classifies them as shown in the following table:

LexemeClassification
Omegavariable
=assign-op
60number
*mul-op
(open-parenthesis
Betavariable
+add-op
Gammavariable
)close-parenthesis
;semicolon

Another action of the scanner is to build a symbol table containing the lexemes of every variable it finds - in this example the symbol table contains Omega, Beta, and Gamma. The scanner also builds a number table which in this example contains 60.

(10.2.2) Phase II: Parsing

The parser assembles the tokens found by the scanner together in a way that satisfies the grammar of the source language. The grammar of the source language is specified by a set of productions expressed in Backus-Naur Form (BNF). The left-side of each production is the name of a grammatical construct, followed by a ::= symbol which means is defined as, followed by a string of grammatical constructs on the right-side. For example the following table shows 10 productions that are part of the grammar for C++:


(1)<statement>::= <variable> = <expression> ;
(2)<expression>::=<term>
(3)<expression>::= <expression> + <term>
(4)<expression>::= <expression> - <term>
(5)<term>::=<factor>
(6)<term>::= <term> * <factor>
(7)<term>::= <term> / <factor>
(8)<factor>::= ( <expression> )
(9)<factor>::=<variable>
(10)<factor>::=<number>

Production (1) in the table says that a <statement> is a <variable> followed by the assignment operator, =, followed by an <expression>, followed by a semicolon.

Productions (2), (3), and (4) in the table say that an <expression> is:

Productions (5), (6), and (7) in the table say that a <term> is:

Productions (8), (9) ,and (10) in the table say that <factor> is:

The parser uses the productions in the grammar to build a parse-tree from the tokens supplied by the scanner. For example, the parse tree for:

        Omega = 60 * (Beta + Gamma) ;
is shown below with each step showing the number of the production used in that step.

The parser gets the correct parse-tree for this sequence of tokens as follows: Ambiguity: The grammar in the table is unambiguous - there is only one parse-tree for any expression.

(10.2.3) Phase III: Semantic Analysis and Code Generation

Semantic Analysis: The parser only examines the syntax of the source language, i.e., does the sequence of tokens look correct. Semantic analysis makes sure that the meaning is correct. For example, the English sentence, The man bit the dog., has the correct syntax as shown by the following parse tree:

but the semantics is questionable - the man bit the dog instead of the dog biting the man.

Semantic analysis for programming languages makes sure:

Code Generation: Code can be generated as the productions of the grammar are performed. The code generated when the statement:

        Omega = 60 * (Beta + Gamma) ;
is treated is shown in the following table along with the number of the production that generated it - the other productions used in the parse-tree don't generate any code.

Generated CodeGenerator
LOAD   Beta
ADD    Gamma
STORE  Temp1
Production (3)
LOAD   Temp1
MULT   N60
STORE  Temp2
Production (6)
LOAD   Temp2
STORE  Omega
Production (1)

(10.2.4) Phase IV: Code Optimization

Code optimization simplifies the code by eliminating wasteful instructions. In our example, the code generated by Phase III is simplified to be:

        LOAD   Beta
        ADD    Gamma
        MULT   N60
        STORE  Omega
Then the symbol table and number table are used to allocate memory for all variables and numbers:
Omega:  .DATA  0
N60:    .DATA  60
Beta:   .DATA  0
Gamma:  .DATA  0

Kenneth E. Batcher - 11/8/2006