3. Arithmetic for Computers

3.1 - Introduction

The natural numbers: 0, 1, 2, 3, ..., can be represented either in decimal or binary form but a computer must also work with negative integers and real numbers. How does a computer add, subtract, multiply, and divide numbers in general?


3.2 - Signed and Unsigned Numbers

Negative Numbers: The odometer in my 11-year-old car reads 12133. Does this mean the car was driven only 1103 miles per year? No. The car was really driven 112,133 miles but the odometer only has 5 digits so it went from 99999 to 00000 when the car passed the 100,000-mile point. An odometer reading of 12133 only indicates that the car was really driven 12,133 miles, or 112,133 miles, or 212,133 miles, etc. The equation for the actual mileage is:

(actual mileage) = (odometer reading) + (100,000) * (an integer)

A d-digit odometer can only hold 10d different integers (from 0 to 10d-1) so every time it overflows from 99...99 to 00...00 it commits an error of 10d; i.e., it counts miles modulo 10d. A b-bit computer register can only hold 2b different natural numbers (from 0 to 2b-1) so every time it overflows from 11...11 to 00...00 it commits an error of 2b; i.e., it performs integer addition, subtraction, and multiplication modulo 2b.

Suppose I add 99,999 more miles on my 11-year-old car. The odometer will then read one mile less (12132 instead of 12133.) If M is the modulus of a register then M-1 acts just like -1. In general, M-n acts just like -n in modulo M arithmetic.

This fact is used to represent negative integers in a computer. If the modulus of a register is M then use the natural number M-n to represent the negative integer -n. All integer addition, subtraction, and multiplication is performed modulo M so the computer will generate the correct answer as long as the correct answer can be represented in the register.

As an example, we assume a 5-digit decimal register with a modulus of 100,000. The negative integer, -99, is represented as 100,000 - 99 = 99901. To compute the square of -99 the computer will try to square 99901 to get 9980209801 but the register only stores the low-order five digits (09801) which is the correct square of -99.

A register with a modulus of M can store M different natural numbers. How many of these natural numbers represent negative integers and how many represent positive integers? Since -0 = 0 it's best to represent -0 with the same natural number as 0. This leaves M-1 natural numbers to represent the non-zero integers. It's best if the negative of any integer can be represented so let's split the difference and use (M-1)/2 natural numbers for positive integers and (M-1)/2 natural numbers for negative integers. Unfortunately, a b-bit binary register (or a d-digit decimal register) has an even modulus, M, so M-1 is not divisible by two. One natural number, M/2, is left over.

For example, consider a 5-digit decimal register. The natural number 00000 can represent integer 0 (or -0). Natural numbers 00001 through 49999 can represent positive integers 1 through 49999, respectively, and natural numbers 50001 through 99999 can represent negative integers -49999 through -1, respectively. What does natural number 50000 represent? We could choose it to represent +50000 or -50000 (with either choice its negation can't be represented). It's best to choose the negative integer (-50000) for the natural number 50000 because the sign of an integer is easy to determine: the leftmost decimal digit of any negative integer is 5, 6, 7, 8, or 9 while the left-most digit of a non-negative integer is 0, 1, 2, 3, or 4.

In MIPS the offset for a branch target is represented as a 16-bit two's complement integer. The modulus, M = 216 = 65536 and signed integers are represented as follows:

16-bit Two's Complement Integers
binary bitsdecimal integer
0000 0000 0000 00000
0000 0000 0000 00011
0000 0000 0000 00102
0000 0000 0000 00113
. . .. . .
0111 1111 1111 110132,765
0111 1111 1111 111032,766
0111 1111 1111 111132,767
1000 0000 0000 0000-32,768
1000 0000 0000 0001-32,767
1000 0000 0000 0010-32,766
1000 0000 0000 0011-32,765
. . .. . .
1111 1111 1111 1101-3
1111 1111 1111 1110-2
1111 1111 1111 1111-1

The left-most bit is the sign bit: it equals 1 for any negative integer and equals 0 for any non-negative integer. What is the value of the signed integer represented in this form? The sum of the weights of the 1-bits. The sign bit has a negative weight: -215 = -32,768. The weight of every other bit is positive:

Bit Weights
-327681638481924096 20481024512256 128643216 8421

For any integers p and q where p is greater than q:

2p + 2p-1 + 2p-2 + ... + 2q+1 + 2q = 2p+1 - 2q

This is useful to evaluate long strings of 1-bits. For example: 0111 1111 1111 0000 has the value 215 - 24 = 32768 - 16 = 32752.

The MIPS computer performs integer arithmetic in its 32-bit registers. The modulus is M = 232 = 4,294,967,296 and signed integers are represented as follows:

32-bit Two's Complement Integers
binary bitsdecimal integer
0000 0000 0000 0000 0000 0000 0000 0000 0
0000 0000 0000 0000 0000 0000 0000 0001 1
0000 0000 0000 0000 0000 0000 0000 0010 2
0000 0000 0000 0000 0000 0000 0000 0011 3
. . .. . .
0111 1111 1111 1111 1111 1111 1111 1101 2,147,483,645
0111 1111 1111 1111 1111 1111 1111 1110 2,147,483,646
0111 1111 1111 1111 1111 1111 1111 1111 2,147,483,647
1000 0000 0000 0000 0000 0000 0000 0000 -2,147,483,648
1000 0000 0000 0000 0000 0000 0000 0001 -2,147,483,647
1000 0000 0000 0000 0000 0000 0000 0010 -2,147,483,646
1000 0000 0000 0000 0000 0000 0000 0011 -2,147,483,645
. . .. . .
1111 1111 1111 1111 1111 1111 1111 1101 -3
1111 1111 1111 1111 1111 1111 1111 1110 -2
1111 1111 1111 1111 1111 1111 1111 1111 -1

The left-most bit is the sign bit with a weight of -231 = -2,147,483,648. The weight of every other bit is positive.

Bit Weights
-2,147,483,6481,073,741,824 536,870,912. . .12864 3216842 1

The MIPS computer extends 16-bit two's complement signed integers to 32 bits by sign extension: the bits of the short integer are copied into the right-most part of the long register and its sign bit is copied repeatedly into the left-most part of the long register.

What is the negative of a signed integer in two's complement form? Complement each bit (change 0 to 1 and 1 to 0) and then add unity.

A register may contain a signed integer in two's complement form or it may contain a memory address which is always non-negative. This affects how comparisons are performed. For example, 1111 1111 1111 1111 1111 1111 1111 1111 has the value of -1 if it represents a signed integer or the value +4,294,967,295 if it is a memory address. To handle this problem the MIPS instruction set contains both signed (slt and slti) and unsigned (sltu and sltiu) comparison instructions.


3.3 - Addition and Subtraction

Addition of binary numbers is similar to paper-and-pencil addition of decimal numbers: start at the right-most column and work toward the left adding the digits in each column and passing any carries into the next column. For binary numbers the only digits are 0 and 1 and a carry is created whenever the sum is two or more. For example, the addition of 0000 0000 0110 1100 and 0000 0000 0101 1010 looks like:

0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0   (augend)
0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0   (addend)
                1 1 1 1           (carries)
-------------------------------
0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0   (sum)
The addition in decimal is 108 + 90 = 198.

Subtraction is performed by adding the negative of the subtrahend (in two's complement form) to the minuend to get the difference. As an example 0000 0000 0101 1010 is subtracted from 0000 0000 0110 1100. First, the subtrahend (0000 0000 0101 1010) is negated by complementing each bit (1111 1111 1010 0101) and adding unity to obtain 1111 1111 1010 0110. Then the negated subtrahend is added to the minuend:

    0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0   (minuend)
    1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0   (negated subtrahend)
(1) 1 1 1 1 1 1 1 1 1 1   1 1         (carries)
    -------------------------------
    0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0   (difference)
This example assumes registers are only 16 bits long so the carry out of the left-most column (shown in parentheses) was neglected. The subtraction in decimal is 108 - 90 = 18.

Overflow Detection: Integer addition (subtraction) overflows when the magnitude of the sum (difference) is too large to be stored in the result register. In the MIPS computer the length of each register is 32 bits so addition (subtraction) is performed with a modulus, M = 232 = 4,294,967,296. If overflow occurs the value stored in the result register has an error of +M or -M.

In 32-bit two's complement arithmetic, a number with a 0 sign bit is in the range of 0 through 231-1 and a number with a 1 sign bit is in the range of -231 through -1. Using these facts plus the fact that overflow commits an error of +M or -M, one can derive the following rules for detecting when overflow occurs:

The MIPS computer has two sets of instructions for integer addition and subtraction: instructions in one set detect overflow on signed integer operations and instructions in the other set don't detect overflows.

The add, add immediate, and subtract instructions (add, addi, sub) assume signed integers are being treated and cause an exception if any result overflows.

The add unsigned, add immediate unsigned, and subtract unsigned instructions (addu, addiu, subu) do not cause exceptions on overflow.


3.4 - Multiplication

As an example we use the paper-and-pencil method to multiply decimal 396 by decimal 302:

    396
x   302
 ------
    792  <-- 2 * 396
   000   <-- 0 * 396
 1188    <-- 3 * 396
 ------
 119592
The multiplier digits are treated one at a time starting at the right and going to the left. The multiplicand (396) is multiplied by each multiplier digit and the products added together with appropriate shifts to get the final product (119592).

Binary numbers can be multiplied in similar fashion. The method is actually simpler than decimal multiplication; each multiplier bit is just 0 or 1 so the multiplicand-by-multiplier-bit products are easily generated. The following example multiplies 1110 (decimal 14) by 1011 (decimal 11):

     1110
x    1011
 --------
     1110  <-- 1 * 1110
    1110   <-- 1 * 1110
   0000    <-- 0 * 1110
  1110     <-- 1 * 1110
 --------
 10011010  <-- final product (decimal 154)
Multiplying an m-bit number by an n-bit number produces a product with m+n bits; the product of a 32-bit multiplicand and a 32-bit multiplier has 64 bits. Figure 3.5 in the text shows a scheme with a 64-bit ALU.

Figure 3.7 simplifies the hardware by shifting the product to the right instead of shifting the multiplicand to the left and using the 64-bit product register for both the product and the multiplier.

Either scheme performs 32 iterations. Each iteration:

  1. adds the multiplicand to the product if the rightmost multiplier bit equals 1;

  2. shifts the multiplicand to the left (fig. 3.5) or the product to the right (fig. 3.7) one bit position; and

  3. shifts the multiplier one bit position to the right to discard the multiplier bit just treated and bring the next bit into the rightmost position.

Multiplying Signed Integers: The multiplicand and the product are in two's complement form so additions and subtractions use two's complement arithmetic.

The multiplier is in two's complement form so its sign bit has a negative weight. Does this complicate the algorithm at all? Not really. If the sign bit of the multiplier is 1 then subtract the multiplicand from the product instead of adding it to the product.

Using Carry Save Adders: The number of iterations is reduced if each iteration examines several multiplier bits and adds several multiples of the multiplicand to the product simultaneously. A tree of carry save adders is used to quickly perform the many additions.

A carry save adder receives three numbers and partially adds them together; instead of producing their sum as one number it produces two numbers which have the same sum. An n-bit carry save adder is a set of n full adders. The full adders are not chained together; each full adder receives three bits, a, b, and c, and produces two bits, r and s, such that 2*r + s = a + b + c. If corresponding bits of three n-bit numbers, A, B, and C, are fed to the set of n full adders the set produces two n-bit numbers, R and S, such that 2*R + S = A + B + C. Number R is shifted left one place to produce a number R' with n+1 bits. R' = 2*R so R' + S = A + B + C. A carry save adder is very fast because it simply outputs the carry bits (as the number R') instead of propagating them to the left.

A carry save adder (CSA) is shown as a block with three inputs and two outputs. Each input and output is a binary number.

   +-----+
-->|     |
   |     |-->
-->| CSA |
   |     |-->
-->|     |
   +-----+
To add K numbers together feed them into a tree of K-2 carry save adders to produce two numbers with the same sum; then add those two numbers together with a carry lookahead adder The delay is not much longer than the carry propagation delay in the carry lookahead adder. Nine numbers are summed in the following diagram:
   +-----+
-->|     |   +-----+
   |     |-->|     |
-->| CSA |   |     |   +-----+
   |     |-->|     |-->|     |
-->|     |   | CSA |   |     |
   +-----+   |     |   |     |   +-----+
             |     |-->|     |-->|     |
   +-----+   |     |   |     |   |     |
-->|     |-->|     |   | CSA |   |     |   +-------+
   |     |   +-----+   |     |   |     |-->|       |
-->| CSA |             |     |-->|     |   | Carry |
   |     |   +-----+   |     |   | CSA |   | Look- |--> 
-->|     |-->|     |   |     |   |     |   | ahead |
   +-----+   |     |-->|     |   |     |   | Adder |
             |     |   +-----+   |     |-->|       |
   +-----+   | CSA |             |     |   +-------+
-->|     |   |     |             |     |
   |     |-->|     |------------>|     |
-->| CSA |   |     |             +-----+
   |     |-->|     |
-->|     |   +-----+
   +-----+
The tree diagrammed above can be used to rapidly add eight multiples of the multiplicand to the product. The product of two 32-bit numbers can be computed in 4 iterations where each iteration examines 8 bits of the multiplier. A copy of the multiplicand (shifted appropriately) is fed into the tree for each 1-bit in the multiplier (zero is fed in for each 0-bit.) The ninth input of the tree receives the previous product and the tree produces the next product.

The fastest way to multiply is to examine all multiplier bits in parallel and feed all required multiples of the multiplicand into a large carry save adder tree. With today's technology a VLSI chip can easily hold the large number of full adders required by this scheme.

MIPS Multiplication: The product of two 32-bit numbers has 64 bits. Registers only have 32 bits so where is the product placed? Some computers place it in a pair or registers: if the destination specified in the instruction is $4, the product is placed in $4 and $5. MIPS has a different approach: store the product in a special 64-bit register and let the program access the left half, Hi, and/or the right half, Lo, with other instructions. Move from Hi (mfhi) copies the left half to a general register and Move from Lo (mflo) copies the right half. The multiplication instructions specify the sources but not the destination, another example of good design demands compromise.


3.5 - Division

Division is harder than the other three basic arithmetic operations (addition, subtraction, and multiplication.) It's harder to learn in grammar school and it's harder to implement in a computer. The divide instruction is slower than the other three arithmetic instructions in almost every computer.

Division starts with two operands, a dividend and a divisor, and computes two quantities, a quotient and a remainder, such that the remainder is less than the divisor and:

(dividend) = (quotient) * (divisor) + (remainder).

A Decimal Example: Here we use long-division to divide decimal 61927 by decimal 469:

        132
    -------
469 ) 61927
    - 469    <-- 1 * 469
      ---
      1502
    - 1407   <-- 3 * 469
      ----
        957
      - 938  <-- 2 * 469
        ---
         19
The dividend is 61927 and the divisor is 469. Long-division found a quotient of 132 and a remainder of 19.

A Binary Example: Long-division with binary numbers is similar. Here we divide 100001010 by 10111:

             1011
      -----------
10111 ) 100001010
       - 10111    <-- first position of divisor
        ------
          101001
       -   10111  <-- divisor shifted right two places
          ------
           100100
       -    10111 <-- divisor shifted right one more place
           ------
             1101
The dividend is 100001010 (decimal 266) and the divisor is 10111 (decimal 23). Long-division found a quotient of 1011 (decimal 11) and a remainder of 1101 (decimal 13).

How was the first position of the divisor determined? We first tried to align the divisor (10111) with the leftmost five bits of the dividend but then the divisor is larger than that part of the dividend (10000). We shifted the divisor right one place so it is less than or equal to dividend bits 100001. This alignment determines where the leftmost 1-bit of the quotient occurs.

Why is the next quotient bit zero? The first subtraction gave us a difference of 1010 and the next dividend bit was appended (brought down) to make it 10100. But 10100 was less than the divisor so the next quotient bit is 0 and another dividend bit was appended to get 101001. Since 101001 is larger than the divisor we now subtract the divisor again.

The second subtraction gives a difference 10010 and the last dividend bit was appended to make it 100100. Since 100100 is larger than the divisor the last quotient bit is 1 and the divisor was subtracted to get a remainder of 1101.

This example illustrated the basic steps a computer performs in a division operation:

Register Sizes: A 32-bit register holds a natural number in the range of 0 through 232 - 1. If we use 32-bit registers for the divisor, the quotient, and the remainder how many bits should we allow for the dividend? The remainder must be less than the divisor so it is in the range of 0 through 232 - 2. Since:

(dividend) = (quotient) * (divisor) + (remainder)

the dividend is in the range of 0 through 264 - 232 - 1 so it should held in a 64-bit register.

Division Overflow: The divisor is in a 32-bit register but its value can be very small. If a large dividend is divided by a small divisor the quotient may be too large for a 32-bit register so the division overflows.

Given the value of a divisor what's the largest dividend value that can be divided without overflow? The largest quotient is 232 - 1 and (largest remainder) = (divisor) - 1 so (largest dividend) = 232 * (divisor) - 1.

Before dividing, overflow can be checked by shifting the divisor 32 places to the left and comparing its value to the dividend: the dividend value should be less than the value of the divisor with this alignment. No dividend can be negative so this check also discovers divide-by-zero errors.

The Basic Steps in One Iteration: Each iteration compares the dividend to the divisor with a certain alignment. If the dividend is less than the divisor then the next quotient bit is 0 and the dividend is not changed. Otherwise, if the dividend is greater than or equal to the divisor then the next quotient bit is 1 and the aligned divisor is subtracted from the dividend.

There are (at least) three different methods of performing these steps:

  1. Do the comparison but don't change the dividend until the next quotient bit is determined. Subtract the aligned divisor from the dividend only if the next quotient bit is 1.

  2. Subtract the aligned divisor from the dividend and then check the sign of the new dividend. If the new dividend is negative then generate a 0 quotient bit and add the aligned divisor back into the dividend to restore the original dividend. Otherwise, if the new dividend is non-negative then generate a 1 quotient bit and leave the dividend as is (because the subtraction has already been done.)

  3. Start the iteration with the aligned divisor already subtracted from the dividend. Check the dividend sign; the next quotient bit equals 0 if the dividend is negative or equals 1 if the dividend is non-negative. Then fix the dividend for the following iteration by adding half the aligned divisor to the dividend if the quotient bit is 0 or subtracting half the aligned divisor from the dividend if the quotient bit is 1.

Method 2 is called restoring division and method 3 is called non restoring division. Very few computers use method 1.

Implementing Restoring-Division: Figure 3.10 in the text shows one scheme for doing restoring-division using a 64-bit ALU. One 64-bit register holds the dividend at the start and the remainder at the end. Another 64-bit register holds the 32-bit divisor: the divisor starts out in the left half of the register and is shifted right one place with each iteration. Another 32-bit register receives the quotient bits as they developed: each iteration shifts the register left one place and enters a quotient bit at the right end.

Figure 3.11 shows the flow chart for division using this hardware. Note that the loop is performed 33 times and produces 33 quotient bits! The first iteration is really the overflow check: the first quotient bit equals 1 if and only if division overflows.

Figure 3.13 cuts the size of the ALU and the divisor register to 32 bits. Rather than shift the divisor to the right one place with each iteration the dividend (remainder) is shifted to the left one place.

The 64-bit remainder register in fig. 3.13 is initialized with the dividend but each iteration shifts it left one place. At the end of division the quotient will be found in the right half of the remainder register (and the remainder will be found in the left half.)

Signed Integer Division: Recall the fundamental equation for division:

(dividend) = (quotient) * (divisor) + (remainder).

Dividing +7 by +2 generates a quotient of +3 and a remainder of +1 to satisfy this equation. A quotient of +4 and a remainder of -1 also satisfies this equation but that solution is rejected because most programmers expect to see a non-negative remainder when a positive number is divided by a positive number.

Suppose -7 is divided by +2. One solution to the fundamental equation is a quotient of -3 and a remainder of -1. Another solution is a quotient of -4 and a remainder of +1. Which solution is best? Since (+7)/(+2) gives a quotient of +3 most programmers would expect to see a quotient of -3 for (-7)/(+2).

Now suppose +7 is divided by -2. One solution to the fundamental equation is a quotient of -3 and a remainder of +1. Another solution is a quotient of -4 and a remainder of -1. Again, since (+7)/(+2) gives a quotient of +3 most programmers would expect to see a quotient of -3 for (+7)/(-2).

Finally, suppose -7 is divided by -2. One solution to the fundamental equation is a quotient of +3 and a remainder of -1. Another solution is a quotient of +4 and a remainder of +1. Again, since (+7)/(+2) gives a quotient of +3 most programmers would expect to see a quotient of +3 for (-7)/(-2).

In all four cases the best solution to the fundamental equation generates a non-zero remainder whose sign is the same as the dividend sign. The general rule is:

Some computers don't follow this rule. The real quotient of (699)/(100) is 6.99 so a quotient of 7 (remainder = -1) is more accurate than a quotient of 6 (remainder = +99.) To get the integer quotient closest to the real quotient pick the solution whose remainder has the least absolute value.

A programmer can easily change the machine's rule by adding a bias to the dividend before division and then subtracting the bias from the remainder after division. The bias value usually depends on the divisor. A bias with half the magnitude of the divisor and the same sign as the dividend will change the general rule so the integer quotient closest to the real quotient is selected.

A useful exercise might be to program various cases on your favorite computer to see what quotients signed integer division produces. You may have to program the cases in assembly code because some compilers modify the machine's rule to fit the rule of the source language.

Implementing Signed Integer Division: One way to do signed integer division is: (1) negate the dividend if it is negative; (2) negate the divisor if it is negative; (3) use the unsigned integer division algorithm to get a quotient and a remainder; (4) negate the quotient if the original dividend and divisor had opposite signs; and (5) negate the remainder if the original dividend was negative.

Signed integer division can also be performed using two's complement arithmetic and modifying the unsigned algorithm appropriately. For example, each iteration selects a quotient bit by comparing the dividend sign to the divisor sign.

MIPS Division: Like the multiply instructions, the divide instructions don't specify a destination. The remainder is left in Hi and the quotient is left in Lo.


3.6 - Floating Point

Many numerical operands are not integers; e.g., 1.5, 23.67, 0.0021, pi = 3.141592653589793238..., e = 2.7182818284590... A real number may or may not have an integer value.

Early computers could only handle integers. How was pi stored in ENIAC? The closest integer to pi is 3 but that's a very poor approximation. Since an ENIAC register held ten decimal digits, pi was stored in a register as the closest integer to pi * 109, 3141592654, and the programmer had to remember its scale factor, 109.

A variable, x, with a maximum value of 14.1421 would be stored in an ENIAC register with a scale factor of 108 to give it the maximum precision with no danger of overflow (the maximum register value is 1414210000.)

ENIAC code for y = pi + x would be: a copy of pi's register into y's register; a shift of y's register one digit to the right (since pi's scale factor is ten times the scale factor of x); and an add of x's register to y's register. The maximum value of pi + x is 17.28369265 so the addition can't overflow and the scale factor for y is 108.

Good estimates of the ranges of the real variables were needed so they could be scaled for maximum precision with no danger of overflow. Overflow bugs were common and difficult to fix. To reduce the scale factor of a variable because it overflowed, a programmer would have to change the code in every place that defined that variable or used that variable.

An early example of system software was a package of subroutines to perform floating point arithmetic. Precision was reduced so a scale factor could be stored with the value of each real operand. The add subroutine, for example, examined the scale factors of its source operands to shift their values into alignment before summing and then shifted the sum for maximum precision (normalization.)

Even though they sacrificed precision and speed, floating-point subroutine packages were popular in early computers because they eliminated overflow bugs. Later computers included floating-point hardware to regain speed (precision is still sacrificed.)

In a binary computer, the value of a floating-point number is:

(-1)S * 2E * F

where the sign bit, S, is 0 or 1, the exponent, E, is a signed integer, and the significand, F, is a non-negative real number less than some constant. Other names for F are fraction or mantissa.

It's best if the length of a floating-point number is compatible with the machine word size; the MIPS computer has 32-bit single precision numbers and 64-bit double precision numbers. The format of a floating-point number has three fields to store S, E, and F. A single bit suffices for the sign bit, S. How long are the fields for E and F, respectively?

The length of the field storing the exponent, E, dictates the range of floating-point numbers. To show this, assume the field for E is only five bits long. E ranges from -16 to +15 and the magnitude of a floating-point number is less than 215 = 32768 (if F is less than unity.) Many programs have operands larger than 32768 so overflows will occur frequently; a 5-bit field for E is much too short.

The length of the field storing the significand, F, dictates the precision of floating-point numbers. For example, if the field for F is only ten bits long then F can only be stored with a precision of ten bits which is about three decimal digits. Any user expects a computer to be able to compute numbers with much higher precision so clearly the F field must be much longer than 10 bits.

If the total length of a floating-point number is fixed then each bit added to the field for F comes from the field for E. Should a designer pick a long field for F (for high precision) or a long field for E (to mimimize overflow occurrences)? Before the 80's machine designers made many different choices. Now, almost all computers (including the MIPS computer) follow the IEEE 754 floating-point standard.

MIPS Single Precision Numbers: The significand, F, is less than two and the format of a non-zero normalized MIPS 32-bit single precision number is:

SE+127F-1
1 bit8 bits <----- 23 bits ----->

A single precision zero has 0-bits in all 32 places:

000000000 00000000000000000000000
1 bit8 bits <----- 23 bits ----->

The exponent, E, of a non-zero number is in the range of -126 to +127 but a bias of 127 is added to E. The 8-bit exponent field really stores E+127 as a natural number in the range of 1 to 254. Biasing the exponent means that a single precision zero has an exponent of -127.

To preserve precision, every floating-point operation normalizes its results so the significand, F, of any non-zero result is at least unity and less than two. There is always a 1-bit in the leftmost position of F so it doesn't need to be stored allowing a 24-bit significand to be stored in a 23-bit field. The number stored in the significand field of a normalized non-zero number is really F - 1 since the hidden leftmost bit has a weight of unity. The precision is 24 bits (about 7.2 decimal digits.)

What's the largest number that can be stored? The bit sequence for the most positive number is:

011111110 11111111111111111111111
1 bit8 bits <----- 23 bits ----->

Changing the sign bit to 1 gives us the bit sequence for the most negative number. The 8-bit exponent field contains decimal 254 so the exponent, E, equals 127. The 23-bit significand field has 23 1-bits so adding the hidden 1-bit makes F = 2 - 2-23. The most positive number that can be stored is (2 - 2-23) * 2127 or about 3.4 * 1038. The most negative number has the same magnitude. Very few good operands exceed these magnitudes so the most likely cause of an overflow is a programming bug.

One way to treat an overflow is to substitute infinity which has the following bit sequence:

011111111 00000000000000000000000
1 bit8 bits <----- 23 bits ----->

An operand set to infinity has some unknown magnitude greater than about 3.4 * 1038. Another special case is Not a Number or NaN with the bit sequence:

011111111 not all 0-bits
1 bit8 bits <----- 23 bits ----->

An operand set to NaN has an unknown magnitude which may be any value: large or small. For example: infinity divided by infinity equals NaN.

What's the least positive normalized number that can be stored? The bit sequence is:

000000001 00000000000000000000000
1 bit8 bits <----- 23 bits ----->

The exponent is 1 - 127 = -126 and the significand is 1 so the value is 2-126 or about 1.2 * 10-38. What happens with a positive number with a lesser value? The simplest scheme is to change it to zero. An option in the IEEE standard is to store it as a de-normalized number with an all-zero exponent field and a non-zero significand field. A negative number very close to zero is treated in a similar manner.

MIPS Double Precision Numbers: The significand, F, is less than two and the format of a non-zero normalized MIPS 64-bit double precision number is:

SE+1023F-1
1 bit11 bits <----- 52 bits ----->

A double precision zero has 0-bits in all 64 places. The exponent, E, of a non-zero number is in the range of -1022 to +1023 but a bias of 1023 is added to E. The 11-bit exponent field really stores E+1023 as a natural number in the range of 1 to 2046. Biasing the exponent means that a single precision zero has an exponent of -1023.

The significand, F, is at least unity and less than two. The precision of F is 53 bits (almost 16 decimal digits) and it is stored in a 52-bit field by hiding its leftmost bit which is always a 1-bit.

The bit sequence for the most positive number is:

011111111110 111 . . . 111
1 bit11 bits <----- 52 bits ----->

Changing the sign bit to 1 gives us the bit sequence for the most negative number. The 11-bit exponent field contains decimal 2046 so the exponent, E, equals 1023. The significand field has 52 1-bits so adding the hidden 1-bit makes F = 2 - 2-52. The most positive number that can be stored is (2 - 2-52) * 21023 or about 1.8 * 10308. The most negative number has the same magnitude. The least positive normalized number is 2-1022 or about 2.2 * 10-308.

Double precision infinity has the following bit sequence:

011111111111 000 . . . 000
1 bit11 bits <----- 52 bits ----->

An operand set to infinity has some unknown magnitude greater than about 1.8 * 10308. The other special case, NaN, has the bit sequence:

011111111111 not all 0-bits
1 bit11 bits <----- 52 bits ----->

Floating-Point Addition/Subtraction: Subtraction is performed by complementing the sign bit of the subtrahend and adding it to the minuend. Adding two floating-point numbers, A and B, takes four basic steps:

  1. Alignment: Compare the exponents of A and B. Let the operand with the greater exponent be G and the operand with the lesser exponent be L (the choice is arbitrary if the exponents are equal.)

    Subtract the exponent of L from the exponent of G to get the exponent difference, D.

    Add the hidden bits to the significands of G and L (a hidden bit is a 1-bit if the operand is normalized.)

    Shift the significand of L to the right D bit positions to align it properly with the significand of G.

  2. Addition: Compare the sign bits of G and L.

    If the sign bits agree then add the aligned significand of L to the significand of G.

    If the sign bits disagree then subtract the aligned significand of L from the significand of G. If the difference is negative then negate the difference to make it positive and complement the sign bit of G.

  3. Normalization: The result is G but it's significand may be out of range.

    If the significand of G is greater than or equal to two then shift the significand one bit position to the right and increment the exponent of G by unity.

    If the significand of G is non-zero but less than unity then shift it to the left until it is greater than or equal to unity. Decrement the exponent of G by the number of bit positions the significand is shifted to the left.

    If the significand of G equals zero then the result is zero so clear the sign bit and exponent fields of G to zero.

    The exponent of G may be out of range after incrementing or decrementing. If the exponent is too large then change G to infinity. If the exponent is too small then change G to zero.

  4. Rounding: The significand of G may be longer than the limit (24 bits for single precision or 53 bits for double precision.) The usual rounding method is truncation: add the leftmost bit that won't be stored to the bit on its left (the rightmost bit that will be stored.)

    In the worst case, the significand could be rounded up to two. In this case, G must be re-normalized by shifting the significand one bit place to the right and incrementing the exponent by unity. Incrementing the exponent may make it too large so G could be changed to infinity.

Figure 3.17 in the text shows the hardware for floating-point addition/subtraction.

Examples of floating-point arithmetic are much simpler when significand fields are very short. The following examples use the MIPS single precision format with the significand field cut to 7 bits:

SE+127F-1
1 bit 8 bits 7 bits

The following examples are left as exercises. With a short 7-bit significand field the last example has a small round-off error.

Floating-Point Multiplication: Let A, B, and C be floating-point numbers where C is the product of A and B. The sign bit, exponent field, and significand field of C are determined as follows:

For an example we let A = decimal 16.5 and B = 15.75 and use the same format as the addition/subtraction examples above. The operands are the same as for the first addition/subtraction example so the bit sequence for A is 0 10000011 0000100 and the bit sequence for B is 0 10000010 1111100.

The following examples are left as exercises. With a short 7-bit significand field all examples have round-off errors.

Floating-Point Division: Floating-point division is not discussed in the text. As an exercise try to outline its basic steps.


3.8 - Fallacies and Pitfalls

The text describes the infamous floating-point divide problem in the first Pentium chips. Some of the most significant bits of the divisor and the dividend are fed into a lookup table to guess whether the next two bits of the quotient are -2, -1, 0, +1, or +2. Any error in the guess is corrected on a subsequent pass.

Unfortunately five entries in the lookup table were wrong - the first 11 quotient bits were always correct but bits 12 through 52 might have errors.

Intel discovered the bug in July 1994 but it would take several months to change, reverify, and put the corrected chip into production. They decided that the error was so insignificant that they would just go ahead and produce bad chips until they got a good chip design in January 1995. A big mistake!

In September 1994, Prof. Nicely at Lynchburg College in Virginia discovers the bug, calls Intel technical support, and posts his discovery on the Web when he gets no official reaction.

In November 1994 Electronic Engineering Times puts the story on its front page and other newspapers pick up the story. Intel issued a press release calling the problem a glitch that would affect only a few dozen users. Customers were told to describe their application to Intel and Intel would decide whether or not they needed a new chip without the bug.

In December 1994 IBM stops shipping PC's with the Pentium and Intel finally issues a press release apologizing for the way they handled the problem and offering to replace any bad chip with a good chip for any owner that requests it. It's estimated that this recall cost Intel $300 million.

Apparently, Intel learned a lesson. In April 1997 a bug was found in the Pentium Pro and Pentium II microprocessors but Intel publicly acknowledged the problem and offered a software patch to get around it.


Kenneth E. Batcher - 8/27/2004