Interarchitecture Comparative Analysis
Jerry Potter
Department of Mathematics and Computer Science
Kent State University
Kent, Ohio 44242
Willard Meilander
Department of Mathematics and Computer Science
Kent State University
Kent, Ohio 44242
Abstract
This paper addresses the issue of inter-architecture comparative
analysis. It proposes that the computation performed by the Arithmetic/ Logical
Unit (ALU) and the communication control performed by the Instruction
Processing Unit (IPU) be considered separately. Under this proposal, the traditional multiple processor
subgrouping ( SIMDs with MIMDs) is replaced by a single instruction stream
grouping (SISD with SIMD). This
reorganization gives a new interpretation to Amdahl's law. These changes emphasize the importance of
communication in parallel algorithms and more accurately reflect the
relationship between multiprocessors(MPs) and associative
processors(APs).
Keywords: Interprocessor overhead, Flynn's Taxonomy, Amdalh's law
Big
problems such as image processing, real time databases, data mining, knowledge
acquisition, real time command and control, and finite element analysis,
require big computers. The trend has
been toward multiple von Neuman processors organized in a wide variety of
multiple processors (MP) to provide the needed processing power. Even though
SIMD processors are widely recognized
to be well suited for data parallel and real time database applications, they
are considered special purpose and
seldom studied.
Unfortunately, experience has shown that MP do not produce the
unlimited computing power that many assume they promise. Most users find that instead of
obtaining linear or near linear
speedup, that as more processors are added, speedup plateaus and then under
certain conditions, actually decreases[1].
It is widely recognized that the major reason for this disappointment is
that MP architectures introduce a communication component not
present in the original sequential algorithm.
Nevertheless, the impact of this communication component is often
ignored or trivialized. One
introductory computer science text [2] describes how easy it is to replicate "embarrassingly parallel"
sequential algorithms such as the Manhattan telephone book problem to run on an MP only mentioning, in
passing, the complications of the
induced communication. In contrast,
associative SIMD processors[3], or APs are general purpose and do not need multiprocessor
communication overhead. Indeed, in many
cases, their utilization actually simplifies the code[4].
In
many texts on parallel computing, Amdahl's law is discussed, ignoring overhead,
as establishing the maximal speedup of parallel systems. This
implies that overhead is a minor inconvenience for which a cure is
available, it just has to be found.
Mischaracterizing the nature of MP overhead is detrimental for two
reasons. First, many programmers are
lead to believe that MPs are the best approach to parallelism. Second, it steals the "no
overhead" advantage from APs. That
is, it implies that all parallel implementations with no or minor communication
overhead are special cases, just as they are in MPs. And that, real, serious AP applications will suffer from
intractable overhead also.
The
next sections will address MP versus AP overhead issues, including
communication overhead, speedup and efficiency, partitioning the CPU, Flynn's
taxonomy and Amdahl's law. Conclusions
and References follow.
The
assumption that a sequential algorithm can be implemented on an MP by simple
replication assumes that the additional code introduced when a sequential
algorithm is converted into an MP algorithm can be ignored or at worst is
bounded by a constant. This is a major
misunderstanding of the nature of control parallelism. In reality, in some applications such as
real-time scheduling, implementing a sequential algorithm on an MP converts it
to an HP-hard algorithm[5]. That is,
this assumption allows an O(n) sequential algorithm to be replaced by a MP parallel
NP‑hard algorithm without a true accounting of the cost. If any simplification is to be employed, the
basic sequential algorithm should be ignored and the analysis should
concentrate on the communication overhead.
Figure 1 is a simple example to emphasize the obvious point that in the limit, overhead is dominate in MP implementations. That is, when converting the sequential pseudo code of Figure 1a into the MP pseudo code of Figure 1b, four additional lines of code are needed for interprocessor communication and synchronization. These are new, induced, components not in the original sequential algorithm and they are dependent on p, the number of CPUs, not n, the size of the problem.
for(I=0; I<n; I++)
if (searchName == dataName[I])
return phoneNumber[I];
a - The Sequential Solution
synchronize
send the searchName to p processors;
for (I=0; I<n/p; I++)
if (searchName == dataName[I])
return phoneNumber[I];
synchronize
reduce answer
b ‑ The MP Solution
Figure 1 ‑ The Manhattan Telephone Book Problem
The
induced communication complexity is more than processor synchronization,
message passing and data movement.
Multiple processors introduce a real time scheduling aspect which can
and frequently does dominate. That is,
the task of determining which processors are available, where the data is
located, which are the best processors for a job and when should they do the
job is a major component of the added complexity. Scheduling of shared resources is an NP-hard problem[5]. The induced overhead is not "small," "constant" or
inconsequential. Indeed it can not be
ignored at all since the time for
communication dominates the analysis of the parallel algorithm.
APs
can achieve superlinear speedup by reducing communication overhead. Consider the problem of finding a largest
number in an unordered list. If
Fahlkoff's algorithm [6] is used, it can be done in constant time on an
AP. Implementing the algorithm on a
conventional sequential 32 bit processor
requires that it be executed n/32 times for the basic searches,
producing n/32 candidates. Additional
searches are required to find the largest number among the candidates. These additional searches are equivalent to
the communication component in a MP implementation. Thus an AP implementation can reduce the need for communication
for certain problems. APs can be
faster and more efficient than conventional machines, i.e.
APs ?<=?
conventional
Consider,
a second problem - finding duplicate entries in an unordered list. On a conventional machine, it is an O(n2)
problem, on an AP it is O(n). On a MP
it becomes NP hard because of the overhead required to "share" the
data between processors. MPs are never
more efficient than conventional processors, i.e.
conventional
< MPs
Therefore,
APs < MPs
These results suggest that
conventional difficult tasks such as real time scheduling can be implemented on
APs in polynomial time, but are NP-hard on MPs.
The standard practice of coupling the data processing
(ALU) and instruction processing (IPU)
components of a computer into a CPU is one major cause of inaccurate
interarchitecture comparative analysis.
In order to more accurately compare algorithms between different
architectures, it is essential that
the CPU be divided into IPU and ALU components. In the case of algorithm analysis on a sequential machine, the
analysis is the same as traditional analysis.
However, in the case of parallel architectures and algorithms, the
importance of communication between
IPUs in control parallelism can be
addressed explicitly.
The
separation of CPU into IPU and ALU suggests that interprocessor complexity analysis should be separated into
a communication/control parallel portion and data parallel portion. Let us reconsider, the unordered Manhattan
telephone book problem. Traditional analysis assigns an O(n) complexity in
the worst case on a sequential
processor. Dividing the analysis into ALU and IPU parts gives 0 for the IPU
part and O(n) for the ALU part. The relationship CPU = IPU + ALU holds. Consider a p processor MP implementation. In the worst case analysis, the ALUs have
O(n/p) complexity and the IPU portion has O(2log p) complexity to account for
the synchronize/broadcast and synchronize/retrieve components shown in Figure
1b. The total is a combined O(2log p +
n/p) complexity. When p=n, the result
is O(2log p). For an AP, when p=n, the complexity is O(1) because there is only
one IPU.
Let
us consider a divide and conquer algorithm to distribute the data. We will look at the work involved ‑
defined here as the sum of all of the moves for the IPU and compares for the
ALU. Consider first the overhead of
dividing the telephone number entries among the p processors. At each step, 2 of the data must be moved and there are a total of p‑1 IPUs involved..
This results in a O((log p)/2 *(p‑1) + 2log p) or O((plog p)/2)
complexity for the IPU portion. Once
the data has been distributed, the work to find the number is O(n) for the ALUs. The total work is O((plog p)/ 2 + n).
In an AP, the data does not need to be
distributed, thus there is no IPU component so the total work is O(n). Note that the ALU work is the same for all
architecture classes, O(n). The IPU
cost for the MP algorithm is O((p*log p)/2), but is 0 for the AP and
conventional machine since they have only one IPU. Thus we see that even for this trivial problem, using
interarchitecture comparative analysis
produces a clearer picture of the
relationships, summarized in Table 1, between solving this problem on different architectures.
Table 1 - Interarchitecture Comparative Analysis
AP conventional MP
complexity
O(1) O(n) O(2log n)
work
O(n) O(n) O(n+n/2(log n))
It
is often claimed that MPs are better than APs because they are more general
purpose. The fact is that both a single
processor MP and an AP are basically
the same as conventional sequential machines.
The difference between them occurs when they are modified to increase
their computing power.
MP achieve more power by
replicating CPU's - both IPUs and ALUs.
APs achieve more power by extending the ALU only, both architectures are
general purpose.
The replicated CPUs give flexibility to the MP architecture, but
it comes at a price - interprocessor
communication. The single IPU of the AP
avoids any MP communication overhead.
This simplicity comes at a price, some algorithms may not be efficiently
implemented. The inefficiency is in
underutilized ALUs. Briefly, MPs are
more complex to program and use IPUs inefficiently for communication
overhead. APs are simple to program and
use ALUs inefficiently by leaving them idle when there is work to do.
Flynn's taxonomy[7] of MIMD, SIMD, SISD and MISD has
been supplemented with many more complex taxonomies to reflect the diversity of
the many MP architectures, but it is still the most fundamental when
distinguishing between architectures based on the number of IPUs and ALUs
(SI/MI and SD/MD in Flynn's terms). MIMDs and SIMDs have traditionally been
grouped together as parallel processors based on the fact that they both
process data in parallel. However, we
feel that if the CPU is divided into an
ALU and IPU, then SIMDs should be grouped with SISDs as illustrated in Figure
2. Of the two components, the IPU component is by far the more powerful (and
expensive) both in hardware and software.
That is, the IPU embodies the Turing machine /FSA capability central to
the power of a von Neuman general purpose digital computer. It is the portion
that carries out and requires interprocessor communication analysis. The ALU by itself can only process two numbers.
When
dealing with different algorithms on similar architectures with similar
CPUs, separating the ALU and IPU
components is not necessary, but when comparing the effectiveness of the
machines on the same problem using different algorithms on different
architectures, bundling the IPU and ALU into one component leads to
grossly inaccurate comparisons. For example, traditionally, a p processor MP
and an p processor SIMD would be considered to be of similar power. But this is equating p IPU/ALUs against only
one IPU with p ALUs. It is the number
of IPUs used, which determines the complexity of the software and the expense
of communication, not the number or size of the ALUs. Thus, the grouping based
on IPUs (i.e. MIMDs with MISDs and SIMDs with SISDs) is more profound than the
traditional grouping based on ALUs.
SISD MISD SISD
MISD
SIMD
MIMD SIMD MIMD
a - MD or ALU View b - SI or IPU View
Figure 2 - Flynn's Taxonomy
An
associative processor (AP) is a special type of SIMD processor. A bit
serial SIMD computer can be thought of as
a SISD with a 512, 1024 or 2048 bit wide ALU. This massive SIMD parallelism is used for parallel numerical computation, but it also is used for
massively parallel searching and can
simulate an associative memory consequently the associative processor
moniker. The primary difference between
an AP and a SISD is the extended ALU and the software needed to utilize
it. In an AP, the data is corner turned
so that searching and bit serial algorithms (such as Falkoff's maximum value
algorithm) are efficiently implemented. Address based computation in general
and pointers in particular are avoided. There are however a few hardware
differences[4]. In an AP, a single
responder can be selected by hardware in constant time from a group of
responders to a search. In addition, as
with all SIMDs, the ALU and IPU are physically separated. Frequently, the IPU is a complete processor
by itself, having its own dedicated IPU, ALU and memory. So that APs frequently have two IPUs,
however, they are always organized in a master/slave relationship so that no
communication overhead is incurred.
The close relationship of APs to
SISDs versus MPs is evident in
Table 1.
6.0 Amdahl's Law
Amdahl's
law has been restated and rephrased in many ways over the years, so we have
included the original quote here:
"A
fairly obvious conclusion which can be drawn at this point is that the effort
expended on achieving high parallel processing rates is wasted unless it is
accompanied by achievements in sequential processing rates of nearly the same
magnitude.[8]"
Clearly
Dr. Amdahl felt that speeding up the sequential portion of an algorithm was
just as important as speeding up the parallel portion. In our interpretation, given the discussion
so far, we feel that "parallel processing" refers to IPU parallelism
and that ALU "parallelism" would be acceptable as speeding up
sequential processing. If our
interpretation is correct, then the claim in Section 3.0 that APs are faster
than sequential machines is acceptable under Amdahl's law. That is, Falkoff's algorithm on an extended
IPU is simply a method of speeding up the sequential portion of an
algorithm. In light of the discussion of
APs and MPs in this paper, it would seem that ignoring overhead, an additional
burden required to achieve a high parallel processing rate, would violate the
spirit of Amdahl's law. That is,
invoking privilege for a component that is known to be intractable for at least
some MP algorithms and then using the violated law to impose limits on other
types of parallelism seems unfair at the least.
An
important application of complexity analysis is the ability to accurately
compare problem solutions across different architectures, sequential and
parallel. One critical aspect of MP
analysis is the impact on complexity of the communications between
processors. We suggest that a more
accurate comparison between architectures can be made if the CPU aspect of the
traditional sequential processor analysis is divided into an ALU component which reflects data parallelism and an IPU
component which reflects control parallelism.
Such a separation will allow the communication component of MP
implementations to be more accurately analyzed without penalizing data parallel
components. We point out that when
interprocessor communication issues of control parallelism are analyzed, APs
are closer to conventional processors than MPs are because they both only have
a single instruction stream.
Traditionally,
APs have been considered as only
suitable for large data parallel problems.
But APs are general purpose and we propose that the basic AP
architecture be extensively considered for the next generation of computers.
The primary problem in today's architecture designs is the von Neuman
bottleneck which restricts the flow of data from memory to the CPU. The AP is easily scalable to take advantage
of any increase in size of the data path.
We feel that a more rigorous interarchitecture analysis of problems and
applications will show that the multi-AP[4] is a good design for large, time consuming problems, and that
the standard AP is a food design for the next generation of the common working
class general purpose computer.
[1] Kumar, V., et. al,
INTRODUCTION TO PARALLEL COMPUTING, Benjamin/Cummings, Redwood City, CA., 1994.
[2] Schneider, G. and J.
Gersting, AN INVITATION TO COMPUTER SCIENCE, PWS, 1998.
[3] Potter, J. L.,et. al,
"ASC: An Associative Computing Paradigm," Computer, November, 1994,
p. 19ff.
[4] Potter, J. L.,
ASSOCIATIVE COMPUTING, Plenum Press, New York, 1992.
[5] Stankovic, J. et. al,
"Implications of Classical Scheduling Results for Real-Time Systems"
in COMPUTER, June, 1995, pp.16-25.
[6] Falkoff, A.,
"Algorithms for Parallel‑Search Memories," J. Assoc. Comput.
Mach. 9, 1962, p. 488ff.
[7] Flynn, M. J., "Some
Computer Organizations and Their Effectiveness," IEEE TRANS. ON COMPUTERS,
Vol C‑21, 1972, p. 948ff.
[9] Amdahl, G. M.,
"Validity of the Single Processor Approach to Achieving Large Scale
Computing Capabilities" in THE PROCEEDINGS OF AFIPS SPRING JOINT COMPUTER
CONFERENCE, 1967, pp. 483-485.