#### **Cray-1 History**

- Jan 1978 CACM article says there are only 12 non-Cray-1vector processors worldwide:
  - Illiac IV is the most powerful processor
  - TI ASC (7 installations) is the most populous
  - CDC STAR 100 (4 installations) is the most publicized
- Recent report says the Cray-1 is more powerful than any of its competitors
  - 138 MFLOPS for sustained periods
  - 250 MFLOPS for short bursts
- Features: chaining (access intermediate results w/o memory references), small size (allows 12.5 ns clock = 80 MHz), memory with 1M 64-bit words

Fall 2000, Lecture 35

2

# **Cray-1 Architecture**

- Computer architecture
  - 12 I/O channels, 16 memory banks, 12 functional units, 4KB of register storage
  - Only 4 chip types
  - Fast main memory, fast computation
- 4 chip types
  - 16x4 bit register chips (6 ns)
  - 1024x1 bit memory chips (50 ns)
  - Simple low- or high-speed gates with both a 5-wide and a 4-wide gate (5/4 NAND)
- Fabrication
  - 6"x8" printed circuit boards
  - ICs in 16-pin packages, up to 288
    packages per board to build 113 different
    module types, up to 72 modules per 28 inch high chassis

#### **Cray-1 Physical Architecture**

- Physical architecture
  - "World's most expensive love-seat"
  - Cylindrical, 8.5' in diameter (seat), 4.5' in diameter (tower), 6.5' tall (tower)
  - Composed of 12 wedge-like columns in 270° arc, "reasonably trim individual" can get inside to work
  - "Love seat" hides power supplies and plumbing for Freon cooling system
- Freon cooling system
  - In each chassis are vertical cooling bars lining each wall
  - Freon is pumped through a stainless steel tube inside an aluminum casing
  - Modules have a copper heat transfer plate that attaches to the cooling bars
  - 70F tube temp = 130F center of module

Fall 2000, Lecture 35

# **Cray-1 Architecture (cont.)**

- Memory (16 banks, 72 modules / bank)
  - 64 modules = 1 bit in 64 bit word
  - 8 modules = check byte for single-bit error correction, double bit error detection
- Functional units
  - 12 pipelined functional units in 4 groups: address, scalar, vector, and floating point
  - Scalar add = 3 cycles, vector add = 3 cycles, floating-point add = 6 cycles, floating-point multiply = 7 cycles, reciprocal approximation = 14 cycles
- Instruction formats
  - Either one or two 16-bit "parcels"
  - Arithmetic and logical instructions operate on 3 registers
  - Read & store instructions access memory

Fall 2000, Lecture 35 4 Fall 2000, Lecture 35

## **Cray-1 Registers**

- Registers
  - 8 address registers (A), 64 address-save registers (B), 8 scalar registers (S), 64 scalar-save registers (T), & 8 64-word vector registers (V)
- 8 24-bit address registers (A)
  - Used as address registers for memory references and as index registers
  - Index the base register for scalar memory references, provide base address and index for vector memory references
  - 24-bit integer address functional units (add, multiply) operate on A data
- 64 24-bit address-save registers (B)
  - Used to store contents of A registers

Fall 2000, Lecture 35

 Vector operation operates on either two vector registers, or one vector register and one scalar register

Chaining

- Parallel vector operations may be processed two ways:
  - Using different functional units and V registers, or
  - By chaining using the result stream from one vector register simultaneously as the operand set for another operation in a different functional unit
    - Intermediate results do not have to be stored in memory, and can even be used before a particular vector operation has finished
    - Similar to data forwarding in the IBM 360's pipeline

## Cray-1 Registers (cont.)

- 8 64-bit scalar registers (S)
  - Used in scalar operations
  - 64-bit integer scalar functional units (add, shift, logical, population/leading zero count) operate on S data
- 64 64-bit scalar-save registers (T)
  - Used to store contents of S registers, typically intermediate results of complex computations
- 8 64-element vector registers (V)
  - Each element is 64 bits wide
  - Each register can contain a vector of data (row of a matrix, etc.)
  - Vector Mask register (VM) controls elements to be accessed, Vector Length register (VL) specifies number of elements to be processed

Fall 2000, Lecture 35

## **Handling Data Hazards**

■ Write / read data hazard example:

|                 |              | 1                    |               | 4 D D            | DO DO D4      |
|-----------------|--------------|----------------------|---------------|------------------|---------------|
| fetch<br>inst 1 | fetch        |                      |               | ADD              | R2, R3, R4    |
| inst 1          | inst 2       |                      |               |                  | , ,           |
|                 | get          | · get ·              |               | ADD              | R1, R2, R6    |
|                 | get<br>R3,R4 | get<br><b>R2</b> ,R6 |               | المار            | 111, 112, 110 |
|                 |              | add                  | add           |                  |               |
|                 |              | add<br>R3,R4         | <b>R2</b> ,R6 |                  |               |
|                 |              |                      | store         | store            |               |
|                 |              |                      | into R2       | store<br>into R1 |               |

■ Can be avoided with *register interlocks* 



■ Can also be avoided with data forwarding

| fetch  | fetch  |        |         |         |   |
|--------|--------|--------|---------|---------|---|
| inst 1 | inst 2 |        |         |         |   |
|        | get    | get    |         |         |   |
|        | R3,R4  | sum,R6 |         |         |   |
|        |        | add    | add     |         |   |
|        |        | R3,R4  | sum,R6  |         |   |
|        |        |        | store   | store   |   |
|        |        |        | into R2 | into R1 | ı |

## **Handling Data Hazards (cont.)**

#### Register interlocks

- An instruction gets blocked until <u>all</u> its source registers are loaded with the appropriate values by earlier instructions
- A "valid / invalid" bit is associated with each register
  - During decode stage, destination register is set to invalid (it will change)
  - Decode stage blocks until all its source (and destination) registers are valid
  - Store stage sets destination register to valid

## ■ Data forwarding

- Output of ALU is connected directly to ALU input buses
- Result of an ALU operation is now available <u>immediately</u> to later instructions (i.e., even before it gets stored in its destination register)

Fall 2000, Lecture 35

# Cray-2 & Cray-3

- Cray Research (Steve Chen) continued to update the Cray-1 with improved technologies: X-MP, Y-MP, etc.
- Cray-2 (Seymour Cray) 1985
  - A 4-way multiprocessor with vectors
  - Whole machine immersed in Fluorinert (artificial blood substitute)
  - 4.1 ns cycle time (3x faster than Cray-1)
  - DRAM memory (instead of SRAM), highly interleaved since DRAM is slower
- Cray-3 (Seymour Cray) 1993

11

- Replace the "C" shape with a cube so all signals take same time to travel
- Spun off to Cray Computer in 1989
- Supposed to have 16 processors, had 1 with a 2 ns cycle time

#### **Miscellaneous**

#### ■ Evolution

- Seymour Cray was a founder of Control Data Corp. (CDC) and principal architect of CDC 1604 (non-vector machines)
- 8600 at was to be mde of tightly-coupled multiprocessors; it was cancelled so Cray left to form another company

#### ■ Software

10

- Cray Operating System (COS) up to 63 jobs in a multiprog. environment
- Cray Fortran Compiler (CFC) optimizes Fortran IV (1966) for the Cray-1
  - Automatically vectorizes many loops that manipulate arrays

#### ■ Front-end computer

 Any computer, such as a Data General Eclipse or IBM 370/168

Fall 2000, Lecture 35

Fall 2000, Lecture 35