The Parallel
Computing Ceiling
Jerry L. Potter
Department of Computer Science
potter@cs.kent.edu
1.0
Introduction
Since the beginnings of computing, the parallel computing
ceiling, as illustrated in Figure 1, has limited the number of processors that
can be effectively utilized simultaneously.
Many parallel applications do not hit the parallel ceiling because they
use limited resources and operate in the near linear portion of the curve. Advances in hardware cost reduction and speed
improvements have provided substantially more throughput so that over time
there is apparent progress. The result
is the impression that parallel computation can, in general, be extended
forever on a linear or near lineal projection. Thus it is assumed that since cluster hardware
is plentiful, it is just a matter of figuring out how to program them to
achieve supercomputer performance. Indeed some
researchers have reported algorithms that have broken the parallel ceiling,
even exceeding it in a superlinear manner using
statistical approaches. Statistical
approaches work well for average cases, but do not address real time or worst case
situations. Other applications have
Figure 1 – The Parallel Ceiling – Throughput versus Number of Processors
been parallelized using a minimum of
communication between parallel components, accounting for near linear algorithms that extend the graph
to the right. But most of today’s remaining
applications such as data mining, air traffic control, and WEB routing are
proving very difficult to conquer even using large numbers processors.
One common thread that exists among these
demanding applications is the need for a dynamic real-time database. A dynamic real-time database is in essence a
single global resource shared by all processors. In order for multiple
processors to work on the same resource, they must be synchronized. This
introduces a complicating communication component not present in any sequential
algorithm. Indeed, it has been shown
that real time dynamic scheduling, also encumbered with synchronization
overhead, is
intractable [1] and that some tasks such as Air Traffic Control (ATF) are NP
hard [2].
2.0 Single Global Address Space
In order to efficiently search
databases, they must be sorted, this makes them a single global
resource. Sorting is a sequential
task. It can not be done completely in
parallel. Even a hybrid system whereby
the individual packets of data are sorted in parallel on separate machines, and
then only globally sorted when necessary is unacceptable for a large dynamic
real time database. Normally, all would
work well, but in the worse case, when new data is added and the processor to
which it is assigned is full, the overflow must be cascaded to one or more of
its neighbors, and the sort becomes sequential.
No matter if an ever larger portion can be done in parallel, some
portion must be done sequentially.
Amdahl's law [3] states that it is this sequential portion that
dominates parallel sorting algorithms.
Moreover, the transition from many
parallel tasks to a sequential one adds complexity to the original sequential
algorithm. It creates a
parallel-to-sequential reduction bottleneck that must be resolved by introducing communication
overhead. This communication overhead
has O(n2) complexity [4] where n is the
number of processors. Thus at some point
adding more processors to solve an intractable problem makes it worse not
better as illustrated by Figure 1. That
is, when the increased cost of the overhead complexity exceeds the savings
gained on the execution of the original algorithm, the throughput actually
decreases as more processors are added.
Unfortunately, in many
algorithm analyses, the communication component is ignored even
though by Amdahl=s
law it is an important, possibly the dominate component. This attitude is so pervasive that textbooks
actually teach that it is acceptable to ignore such factors [5].
The difficulty of sorting a global
database in parallel is due to the need for a single global address space inherent in the
basic von Neumann model of computation which is characterized as A..data are
kept in named locations, and correct computational behavior is enforced by the
programmer who describes the exact order in which computations are to be
performed [6].@ The single address space concept
causes other complications as well. That
is, complex auxiliary data structures such as linked lists, queues, hash tables, heaps and the
algorithms to support them are required by the single address space model. Even
the quintessential computer science problem of dynamic memory allocation and
garbage collection is a result of the von Neumann model of computation.
2.1 The Impact on Hardware
The single address space bottleneck
caused by the von Neumann model is carried over into today’s conventional
architecture. For example, large volumes of
data must be processed by looping. Each
datum is processed one at a time by repeatedly fetching the same instructions
over and over. Parallelism does not
help, the same number of instructions (in fact more if overhead is counted) must be fetched
whether that data is processed in a single sequential machine or in a parallel cluster. This requirement puts a
strain on hardware design. As advances make
possible the access and manipulation of larger volumes of data and instructions
in parallel with word widths of 64, 128 and even 256 bits, baroque designs such as the very wide word
architecture must be instituted to accommodate fetching several sequential
instructions simultaneously even though there is a very high probability that
some of the instructions will not be executed due to loop branches. There is a basic imbalance on the ability to
fetch large numbers of instructions and execute them efficiently. Yet the through put in compute intensive
applications is often dependent on the amount of data that can be
delivered in parallel. Skillicorn=s
von Neumann uniprocessor design
contains a separate memory for program and data would help reduce the memory to
CPU congestion but with a conventional single combined memory design, the
burden of sharing it efficiently falls on the compiler and other support
software such as profilers. These
software tools are run "off-line" and can not adapt to real time or
worst case situations.
Not only are the straight forward
advances such as wide memory-to-processor busses hindered by the von Neumann
model, but the innovative approaches to computing hardware such as
processor-in-memory or smart memories are also limited [7]. Similarly, the promise of nano-computation,
i.e. bio, molecular and optical computing, are hindered. Any new initiative to build a super computer
based on the conventional hardware implementation of the von Neumann model for
specific areas of research can not produce any great advancements
because of the hindrances of the von Neumann model on parallel computing.
2.2 The Impact on Software
Conventional programming languages
can not solve the single address space limitation of the von Neumann
model. Indeed, the von Neumann model of
computation is the foundation upon which all modern procedural languages such
as C++ and Java are based. Modern
software techniques such as structured programming and object oriented
programming don't change the sequential nature of von Neumann computing, they
only complicates it. OOP tries to hide
the complexity of the sequential nature of the von Neumann model using
techniques such as encapsulation and inheritance, but ultimately it only makes
programming more complex. A..in some
sense, the language used on a particular machine corresponds to the model of
computation that it uses. If this is not
the case, then there is a very bad >fit= between language and implementation ...[6].@
Debugging is a major component of the
software design and development effort.
Data hiding and the black box approach to programming makes debugging
harder. By definition, bugs exist where
they are not expected. And by design of
the von Neumann model, computers execute code sequentially. Therefore, to a large extent, debugging must
be sequential in nature even though the bug itself may not be. That is, a bug may be due to unanticipated
data, a design flaw or a simple typing error, but to find the actual section of
code responsible for the bug requires that the program be traced sequentially
from the beginning. Tracing the
sequential execution of a program in a data hiding and inherited object
environment is extremely complex since any one of a number of classes in a
hierarchy may contain the offending code and the code for any one class may be
located in several different locations.
The point that software design principles may have been violated in the
design of the program is moot because again, by definition, a bug is a mistake
and/or a design violation.
3.0 Alternatives to the von Neumann
Model
Despite the stated limitations, the
von Neumann model of computing has been very successful in solving many complex
problems. However, many of the problems
which dominate today’s parallel computing research activities require a single
global address space when using the von Neumann model. It would appear that the
state of the art of modern hardware design, algorithms, software and languages
are banging against the ceiling limit imposed by the von Neumann model. No computing system built under this model
for Aintractable problems@ has been able to combine more than a
few (8 to 16) processors together whether in a NUMA, a SUMA, a SMP or a cluster
or any other configuration to efficiently overcome the problems imposed
by the single address space.
There are alternative computing
models. For example, dataflow, systolic
arrays and associative among others.
Consider the associative computing model, it addresses the single address
space problem directly by using a two dimensional address space instead of a
single global address space. A two
dimensional memory maps easily onto natural conventional two dimensional data
organizations such as arrays, tables and charts. One dimension can be treated as sequential
while the other is parallel. Data can be
processed in the parallel dimension without looping and logical conditions are
handled by masking thus minimizing branching.
The associative model is easy to use and because memory is partitioned
into program memory and data memory as in the von Neumann uniprocessor machine design the hardware can be optimized
for each task. The parallel von Neumann
array processors hardware implementations are well suited for the associative
model [6].
3.1 The Associative Memory Model
Records in the associative model are
accessed by content, not address.
Records can be found in constant time regardless of their location in
memory. Records are retrieved based on
inherent data such as date of order or length of time on back order or both
simultaneously. There is no need to
reorganize data to optimize a search or for auxiliary fields or data structures
such as inverted lists. Since there is
no single global address space, there is no need to sort and consequently all
data can be posted as soon as it arrives in the system.
Logical relationships between
records, such as those requiring trees or graphs, are implemented directly in
structure codes, not indirectly by layers of indirect addressing. Structures codes are like any other field in
a record. Thus the data at the nodes and
the structural relationship of the nodes can both be accessed directly and in
constant time. Instructions required to search
the logical structure are open and direct, they are not obscured by extraneous
operations demanded by auxiliary data structures such as pointer chasing. Given a record, its structure code can be
used to find any other record's structure code related to it in constant time,
e.g. parent, children, siblings, nearest left sibling, nearest right sibling,
etc.
Since data is located by content, it
can be placed in any associative cell, when new memory is needed, it can be
allocated from any available data memory.
And when it is no longer needed, it is simply flagged as "idle." There is no need for sorting. There is no need for heaps or garbage
collection.
3.2 Associative Programming
The associative model is easy to use
and program, closely paralleling human
thought. Human thought is basically
sequential, but we have no problem understanding repetitive execution of a
single task. Thus when we plan, we
think, Awash our hands then wash the dishes@, with no concern about how many
hands or dishes are being washed. The
sequential process of planning and the "parallel" process of
execution are different and separate. In the associative model, there is a
definite distinction between program or planning memory and data or execution
memory. Program storage, fetching and
execution is sequential and the hardware is the same as in the conventional
computers. However, data is stored in a
separate memory. It is accessed associatively
and processed in parallel without any concern as to how many records there
are. Consequently, in the associative
model there is no commingling of planning tasks and executing tasks. For example, FOR loops are not required to
process arrays of similar data, such tasks are accomplished with one statement
in a program. The conversion of
"natural parallel thoughts" into loops of sequential control is a
cause of considerable complexity in many of todays
modern programming languages.
Humans also use associative
techniques in every day life. They can
not remember a persons name (i.e. their "address"), but can remember
what they were wearing, the color of their hair, etc. (i.e. some properties "associated"
with the person). This natural ability
to remember descriptions rather than addresses is the basis of natural language
[8]. Indeed, a type of associative
memory is used in programming languages in the form of hash tables. That is, you access a symbol table record by
the symbol's name, not by the address of the location in which it was
stored. Of course in the von Neumann
model, in order to use hash tables you must incur overhead to translate the
name into an address in a special address space and once in address form you
get involved with the problems of keeping the address space organized - i.e.
sorted. Programmers appreciate the convenience of hash functions, but they use
them only sparingly because of the inefficient memory utilization, slow
execution and they provide only a crude associative capability. The associative
model provides comprehensive associative access with the easy of use of hash
tables and none of the memory or execution overhead.
3.3 Associative Hardware
Traditional associative memories are
comparatively expensive. However, associative computer designs which use
conventional sequential hardware components have been around for years [9] and
fall into Skillicorn=s
class of von Neumann Aarray processors@ having a single instruction memory
and processor and multiple data processors and memories [6]. The main reason for the lack of acceptance
of these computers is a poor understanding of the associative model versus the
von Neumann model. Indeed most of these
computers such as the Connection Machine were programmed using the von Neumann
model [10].
4.0 Pro von Neumann Biases
The associative model holds many
advantages over the von Neumann model, but it is not necessarily the solution
to all of today’s computing problems.
Much more research and development is required. For example, parallel associative programming
needs to be explored. However, the
biases of the single memory von Neumann programming model need to be made
clear, so that the distinction between it and other models can be accurately
compared. Much, virtually all, of todays computer science is not only dominated by but is
biased towards the conventional single memory von Neumann model. For example, in complexity analysis, the
effect of a shared program/data memory is ignored. Moreover, the efficiently of a parallel
algorithm is based on the standard of the ratio of its performance to that of
the best sequential (that is, the
conventional single memory von Neumann model) algorithm. When alternative algorithms from other
programming models claim to accomplish tasks "faster" than possible
in the von Neumann model, it is said to be impossible. This misunderstanding leads to the Falkoff dichotomy.
Specifically, Falkoff’s maximum value
algorithm can find the maximum value in an unsorted list of numbers in constant
time, while it takes O(n) to find the maximum value in the von Neumann model. Since everyone knows that an improvement of
O(n) is impossible, computer scientists analyze the algorithm and associative
model rejecting assumptions that they take for granted in the von Neumann model
(such as the speed of electrons in a wire) and in essence penalize the
associative model for possessing an intelligent memory. As a result, considerable effort has been
expended to prove that the assumptions for the associative model are just as
valid as the assumptions for the von Neumann model [11].
In the von Neumann model, control
parallelism is used to achieve data parallelism, thus in complexity analysis,
when comparing algorithms from the same programming model, it is possible to combine the cost of program execution (i.e. the instruction
stream or IS) with the cost of data processing (i.e. the ALU) into a single
entity, the CPU. Because of this von
Neumann bias, associative computers in general, and Falkoff's
algorithm in particular is charged with multiple IS where only one is used
[12]. The separation of CPU into ALU and
IS components allows a much more accurate analysis of parallel architectures
such as clusters where the true cost of communication overhead can be separated
and analyzed. For example, as described
above, it is much easier to program in a data parallel model than a control
parallel one. Separating CPU into ALU
(data) and IS (control) components will give insight not only into the most
efficiently algorithms for execution, but is important for hardware design as
well, since the cost of an IS and its need for complex cache memory hierarchies
are as much if not more than the cost of the ALU. The recognition of the true cost of these
components is necessary if more efficient computer implementations are to be
achieved and the full promise of computer nano-technology
recognized.
5.0 Conclusion
Certain problems are intractable
using the von Neumann model of computation.
They are characterized as requiring a global, real time database. By Amdahl’s law, after the parallel component
has been reduced to zero, adding more processors increases the communication
demands of the remaining sequential portion actually reducing through put, not
increasing it. The parallel ceiling is
due to the von Neumann model’s use of a single global address space and can not
be overcome by new or different parallel configurations. Specifically, clusters and “new”
supercomputer designs suffer from the same constraint as the older traditional
parallel machines – reliance on the von Neumann model of computing. Faster cheaper components allow more
computation in a given time for a given dollar amount, giving the illusion that
progress in parallel computing is being made, but it is the von Neumann
computing model itself that causes the ceiling.
The hardware just raises it, it can not eliminate it.
Alternative models such as
associative computing do exit which eliminate the reliance on a single address
space. These models need much more
exploration if today’s intractable problems are to be solved and the promise of
massive nano-based computation is to be achieved.
6.0 Bibliography
[1].
Stankovitc, J. A., M. Spuri,
M. Di Natel and G. C. Buttazzo, AImplementations of Classical Scheduling Results for Real-Time
Systems,@ Computer,
June 1995, pp. 16-25.
[2]. Meilander, Willard, J. Baker and J. Potter, “Predictability for Real-Time Command and Control, The Proceedings of the 15th International Parallel and Distributed Processing Symposium, San Francisco, April 27, 2001.
[3].
Amdahl, G., AValidity of the Single Processor Approach to Achieving Large
Scale Computing Capabilities,@ Spring Joint Computer
Conference, 1967, pp. 483-485.
[4].
Potter, Jerry L. ALimits on High Performance Applications”
in Proceedings of the SPIE=s
International Symposium Applications@ in, ITCom 2001, Commercial Applications for
High-Performance Computing, Denver, CO. August 21-22, 2001, pp. 61-67.
[5]. Schneider, G. and J. Gerstring,
An Invitation to Computer Science,
PWS, 1998.
[6]. Skillicorn,
David B., AA Taxonomy for Computer Architectures@ in Computer, November, 1988, pp. 46-57.
[7].
Patterson, D., et. Al., AIntelligent RAM(IRAM): The Industrial Setting, Applications
and Architectures,@ ICCD 97, The International Conference on Computer Design,
Austin, Texas, Oct. 10-12, 1997.
[8]. Winston, Patrick,
Artificial Intelligence,
Addison-Wesley, Reading, MA, 1984.
[9].
Rudolph, J. A., ASTARAN A Production Implementation of an Associative Array
Processor@ in Proceedings of the Fall Joint
Computer Conference, December, 1972.
[10].
Connection Machine Model CM-2 Technical Summary, Thinking Machines
Corporation Technical Report Series HA87-4, 1987.
[11]. Jin, M., J. Baker and K.
Batcher, Timings for Associative Operations on the MASC model, in The Proceedings of the 15th International
Parallel and Distributed Processing Symposium, San Francisco, April 27,
2001.
[12]. Potter, Jerry L. And W. C. Meilander, AInterarchitecture Comparative Analysis@ in Proceedings of the International Conference on Communication in
Computing, Monte Carlo Resort, Las Vegas, NV, June 26-29, 2000.
Dell c:\...\papers\ceiling1.doc