Interarchitecture Comparative Analysis
Department of Mathematics and Computer Science
Kent State University
Kent, Ohio 44242
Department of Mathematics and Computer Science
Kent State University
Kent, Ohio 44242
This paper addresses the issue of inter-architecture comparative analysis. It proposes that the computation performed by the Arithmetic/ Logical Unit (ALU) and the communication control performed by the Instruction Processing Unit (IPU) be considered separately. Under this proposal, the traditional multiple processor subgrouping ( SIMDs with MIMDs) is replaced by a single instruction stream grouping (SISD with SIMD). This reorganization gives a new interpretation to Amdahl's law. These changes emphasize the importance of communication in parallel algorithms and more accurately reflect the relationship between multiprocessors(MPs) and associative processors(APs).
Keywords: Interprocessor overhead, Flynn's Taxonomy, Amdalh's law
Big problems such as image processing, real time databases, data mining, knowledge acquisition, real time command and control, and finite element analysis, require big computers. The trend has been toward multiple von Neuman processors organized in a wide variety of multiple processors (MP) to provide the needed processing power. Even though SIMD processors are widely recognized to be well suited for data parallel and real time database applications, they are considered special purpose and seldom studied.
Unfortunately, experience has shown that MP do not produce the unlimited computing power that many assume they promise. Most users find that instead of obtaining linear or near linear speedup, that as more processors are added, speedup plateaus and then under certain conditions, actually decreases. It is widely recognized that the major reason for this disappointment is that MP architectures introduce a communication component not present in the original sequential algorithm. Nevertheless, the impact of this communication component is often ignored or trivialized. One introductory computer science text  describes how easy it is to replicate "embarrassingly parallel" sequential algorithms such as the Manhattan telephone book problem to run on an MP only mentioning, in passing, the complications of the induced communication. In contrast, associative SIMD processors, or APs are general purpose and do not need multiprocessor communication overhead. Indeed, in many cases, their utilization actually simplifies the code.
In many texts on parallel computing, Amdahl's law is discussed, ignoring overhead, as establishing the maximal speedup of parallel systems. This implies that overhead is a minor inconvenience for which a cure is available, it just has to be found. Mischaracterizing the nature of MP overhead is detrimental for two reasons. First, many programmers are lead to believe that MPs are the best approach to parallelism. Second, it steals the "no overhead" advantage from APs. That is, it implies that all parallel implementations with no or minor communication overhead are special cases, just as they are in MPs. And that, real, serious AP applications will suffer from intractable overhead also.
The next sections will address MP versus AP overhead issues, including communication overhead, speedup and efficiency, partitioning the CPU, Flynn's taxonomy and Amdahl's law. Conclusions and References follow.
The assumption that a sequential algorithm can be implemented on an MP by simple replication assumes that the additional code introduced when a sequential algorithm is converted into an MP algorithm can be ignored or at worst is bounded by a constant. This is a major misunderstanding of the nature of control parallelism. In reality, in some applications such as real-time scheduling, implementing a sequential algorithm on an MP converts it to an HP-hard algorithm. That is, this assumption allows an O(n) sequential algorithm to be replaced by a MP parallel NP‑hard algorithm without a true accounting of the cost. If any simplification is to be employed, the basic sequential algorithm should be ignored and the analysis should concentrate on the communication overhead.
Figure 1 is a simple example to emphasize the obvious point that in the limit, overhead is dominate in MP implementations. That is, when converting the sequential pseudo code of Figure 1a into the MP pseudo code of Figure 1b, four additional lines of code are needed for interprocessor communication and synchronization. These are new, induced, components not in the original sequential algorithm and they are dependent on p, the number of CPUs, not n, the size of the problem.
for(I=0; I<n; I++)
if (searchName == dataName[I])
a - The Sequential Solution
send the searchName to p processors;
for (I=0; I<n/p; I++)
if (searchName == dataName[I])
b ‑ The MP Solution
Figure 1 ‑ The Manhattan Telephone Book Problem
The induced communication complexity is more than processor synchronization, message passing and data movement. Multiple processors introduce a real time scheduling aspect which can and frequently does dominate. That is, the task of determining which processors are available, where the data is located, which are the best processors for a job and when should they do the job is a major component of the added complexity. Scheduling of shared resources is an NP-hard problem. The induced overhead is not "small," "constant" or inconsequential. Indeed it can not be ignored at all since the time for communication dominates the analysis of the parallel algorithm.
APs can achieve superlinear speedup by reducing communication overhead. Consider the problem of finding a largest number in an unordered list. If Fahlkoff's algorithm  is used, it can be done in constant time on an AP. Implementing the algorithm on a conventional sequential 32 bit processor requires that it be executed n/32 times for the basic searches, producing n/32 candidates. Additional searches are required to find the largest number among the candidates. These additional searches are equivalent to the communication component in a MP implementation. Thus an AP implementation can reduce the need for communication for certain problems. APs can be faster and more efficient than conventional machines, i.e.
APs ?<=? conventional
Consider, a second problem - finding duplicate entries in an unordered list. On a conventional machine, it is an O(n2) problem, on an AP it is O(n). On a MP it becomes NP hard because of the overhead required to "share" the data between processors. MPs are never more efficient than conventional processors, i.e.
conventional < MPs
APs < MPs
These results suggest that conventional difficult tasks such as real time scheduling can be implemented on APs in polynomial time, but are NP-hard on MPs.
The standard practice of coupling the data processing (ALU) and instruction processing (IPU) components of a computer into a CPU is one major cause of inaccurate interarchitecture comparative analysis. In order to more accurately compare algorithms between different architectures, it is essential that the CPU be divided into IPU and ALU components. In the case of algorithm analysis on a sequential machine, the analysis is the same as traditional analysis. However, in the case of parallel architectures and algorithms, the importance of communication between IPUs in control parallelism can be addressed explicitly.
The separation of CPU into IPU and ALU suggests that interprocessor complexity analysis should be separated into a communication/control parallel portion and data parallel portion. Let us reconsider, the unordered Manhattan telephone book problem. Traditional analysis assigns an O(n) complexity in the worst case on a sequential processor. Dividing the analysis into ALU and IPU parts gives 0 for the IPU part and O(n) for the ALU part. The relationship CPU = IPU + ALU holds. Consider a p processor MP implementation. In the worst case analysis, the ALUs have O(n/p) complexity and the IPU portion has O(2log p) complexity to account for the synchronize/broadcast and synchronize/retrieve components shown in Figure 1b. The total is a combined O(2log p + n/p) complexity. When p=n, the result is O(2log p). For an AP, when p=n, the complexity is O(1) because there is only one IPU.
Let us consider a divide and conquer algorithm to distribute the data. We will look at the work involved ‑ defined here as the sum of all of the moves for the IPU and compares for the ALU. Consider first the overhead of dividing the telephone number entries among the p processors. At each step, 2 of the data must be moved and there are a total of p‑1 IPUs involved.. This results in a O((log p)/2 *(p‑1) + 2log p) or O((plog p)/2) complexity for the IPU portion. Once the data has been distributed, the work to find the number is O(n) for the ALUs. The total work is O((plog p)/ 2 + n).
In an AP, the data does not need to be distributed, thus there is no IPU component so the total work is O(n). Note that the ALU work is the same for all architecture classes, O(n). The IPU cost for the MP algorithm is O((p*log p)/2), but is 0 for the AP and conventional machine since they have only one IPU. Thus we see that even for this trivial problem, using interarchitecture comparative analysis produces a clearer picture of the relationships, summarized in Table 1, between solving this problem on different architectures.
Table 1 - Interarchitecture Comparative Analysis
AP conventional MP
complexity O(1) O(n) O(2log n)
work O(n) O(n) O(n+n/2(log n))
It is often claimed that MPs are better than APs because they are more general purpose. The fact is that both a single processor MP and an AP are basically the same as conventional sequential machines. The difference between them occurs when they are modified to increase their computing power.
MP achieve more power by replicating CPU's - both IPUs and ALUs. APs achieve more power by extending the ALU only, both architectures are general purpose.
The replicated CPUs give flexibility to the MP architecture, but it comes at a price - interprocessor communication. The single IPU of the AP avoids any MP communication overhead. This simplicity comes at a price, some algorithms may not be efficiently implemented. The inefficiency is in underutilized ALUs. Briefly, MPs are more complex to program and use IPUs inefficiently for communication overhead. APs are simple to program and use ALUs inefficiently by leaving them idle when there is work to do.
Flynn's taxonomy of MIMD, SIMD, SISD and MISD has been supplemented with many more complex taxonomies to reflect the diversity of the many MP architectures, but it is still the most fundamental when distinguishing between architectures based on the number of IPUs and ALUs (SI/MI and SD/MD in Flynn's terms). MIMDs and SIMDs have traditionally been grouped together as parallel processors based on the fact that they both process data in parallel. However, we feel that if the CPU is divided into an ALU and IPU, then SIMDs should be grouped with SISDs as illustrated in Figure 2. Of the two components, the IPU component is by far the more powerful (and expensive) both in hardware and software. That is, the IPU embodies the Turing machine /FSA capability central to the power of a von Neuman general purpose digital computer. It is the portion that carries out and requires interprocessor communication analysis. The ALU by itself can only process two numbers.
When dealing with different algorithms on similar architectures with similar CPUs, separating the ALU and IPU components is not necessary, but when comparing the effectiveness of the machines on the same problem using different algorithms on different architectures, bundling the IPU and ALU into one component leads to grossly inaccurate comparisons. For example, traditionally, a p processor MP and an p processor SIMD would be considered to be of similar power. But this is equating p IPU/ALUs against only one IPU with p ALUs. It is the number of IPUs used, which determines the complexity of the software and the expense of communication, not the number or size of the ALUs. Thus, the grouping based on IPUs (i.e. MIMDs with MISDs and SIMDs with SISDs) is more profound than the traditional grouping based on ALUs.
SISD MISD SISD MISD
SIMD MIMD SIMD MIMD
a - MD or ALU View b - SI or IPU View
Figure 2 - Flynn's Taxonomy
An associative processor (AP) is a special type of SIMD processor. A bit serial SIMD computer can be thought of as a SISD with a 512, 1024 or 2048 bit wide ALU. This massive SIMD parallelism is used for parallel numerical computation, but it also is used for massively parallel searching and can simulate an associative memory consequently the associative processor moniker. The primary difference between an AP and a SISD is the extended ALU and the software needed to utilize it. In an AP, the data is corner turned so that searching and bit serial algorithms (such as Falkoff's maximum value algorithm) are efficiently implemented. Address based computation in general and pointers in particular are avoided. There are however a few hardware differences. In an AP, a single responder can be selected by hardware in constant time from a group of responders to a search. In addition, as with all SIMDs, the ALU and IPU are physically separated. Frequently, the IPU is a complete processor by itself, having its own dedicated IPU, ALU and memory. So that APs frequently have two IPUs, however, they are always organized in a master/slave relationship so that no communication overhead is incurred. The close relationship of APs to SISDs versus MPs is evident in Table 1.
6.0 Amdahl's Law
Amdahl's law has been restated and rephrased in many ways over the years, so we have included the original quote here:
"A fairly obvious conclusion which can be drawn at this point is that the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of nearly the same magnitude."
Clearly Dr. Amdahl felt that speeding up the sequential portion of an algorithm was just as important as speeding up the parallel portion. In our interpretation, given the discussion so far, we feel that "parallel processing" refers to IPU parallelism and that ALU "parallelism" would be acceptable as speeding up sequential processing. If our interpretation is correct, then the claim in Section 3.0 that APs are faster than sequential machines is acceptable under Amdahl's law. That is, Falkoff's algorithm on an extended IPU is simply a method of speeding up the sequential portion of an algorithm. In light of the discussion of APs and MPs in this paper, it would seem that ignoring overhead, an additional burden required to achieve a high parallel processing rate, would violate the spirit of Amdahl's law. That is, invoking privilege for a component that is known to be intractable for at least some MP algorithms and then using the violated law to impose limits on other types of parallelism seems unfair at the least.
An important application of complexity analysis is the ability to accurately compare problem solutions across different architectures, sequential and parallel. One critical aspect of MP analysis is the impact on complexity of the communications between processors. We suggest that a more accurate comparison between architectures can be made if the CPU aspect of the traditional sequential processor analysis is divided into an ALU component which reflects data parallelism and an IPU component which reflects control parallelism. Such a separation will allow the communication component of MP implementations to be more accurately analyzed without penalizing data parallel components. We point out that when interprocessor communication issues of control parallelism are analyzed, APs are closer to conventional processors than MPs are because they both only have a single instruction stream.
Traditionally, APs have been considered as only suitable for large data parallel problems. But APs are general purpose and we propose that the basic AP architecture be extensively considered for the next generation of computers. The primary problem in today's architecture designs is the von Neuman bottleneck which restricts the flow of data from memory to the CPU. The AP is easily scalable to take advantage of any increase in size of the data path. We feel that a more rigorous interarchitecture analysis of problems and applications will show that the multi-AP is a good design for large, time consuming problems, and that the standard AP is a food design for the next generation of the common working class general purpose computer.
 Kumar, V., et. al, INTRODUCTION TO PARALLEL COMPUTING, Benjamin/Cummings, Redwood City, CA., 1994.
 Schneider, G. and J. Gersting, AN INVITATION TO COMPUTER SCIENCE, PWS, 1998.
 Potter, J. L.,et. al, "ASC: An Associative Computing Paradigm," Computer, November, 1994, p. 19ff.
 Potter, J. L., ASSOCIATIVE COMPUTING, Plenum Press, New York, 1992.
 Stankovic, J. et. al, "Implications of Classical Scheduling Results for Real-Time Systems" in COMPUTER, June, 1995, pp.16-25.
 Falkoff, A., "Algorithms for Parallel‑Search Memories," J. Assoc. Comput. Mach. 9, 1962, p. 488ff.
 Flynn, M. J., "Some Computer Organizations and Their Effectiveness," IEEE TRANS. ON COMPUTERS, Vol C‑21, 1972, p. 948ff.
 Amdahl, G. M., "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities" in THE PROCEEDINGS OF AFIPS SPRING JOINT COMPUTER CONFERENCE, 1967, pp. 483-485.