Interarchitecture Comparative Analysis


 

Jerry Potter

Department of Mathematics and Computer Science

Kent State University

Kent, Ohio 44242

 


 

 

Willard Meilander

Department of Mathematics and Computer Science

Kent State University

Kent, Ohio 44242

 

 


Abstract

This paper addresses the issue of inte­r-architecture comparative analysis. It proposes that the computation performed by the Arithmetic/ Logical Unit (ALU) and the communication control performed by the Instruction Processing Unit (IPU) be considered separately.  Under this proposal, the traditional multiple processor subgrouping ( SIMDs with MIMDs) is replaced by a single instruction stream grouping (SISD with SIMD).  This reorganization gives a new interpretation to Amdahl's law.  These changes emphasize the importance of communication in parallel algorithms and more accurately reflect  the  relationship between multiprocessors(MPs) and associative processors(APs).

 

Keywords:  Interprocessor  overhead, Flynn's Taxonomy, Amdalh's law

1.0 Introduction

Big problems such as image processing, real time databases, data mining, knowledge acquisition, real time command and control, and finite element analysis, require big computers.  The trend has been toward multiple von Neuman processors organized in a wide variety of multiple processors (MP) to provide the needed processing power. Even though SIMD processors  are widely recognized to be well suited for data parallel and real time database applications, they are considered  special purpose and seldom studied.

 Unfortunately, experience  has shown that MP do not produce the unlimited computing power that many assume they promise.  Most users find that instead of obtaining  linear or near linear speedup, that as more processors are added, speedup plateaus and then under certain conditions, actually decreases[1].  It is widely recognized that the major reason for this disappointment is that  MP architectures  introduce a communication component not present in the original sequential algorithm.   Nevertheless, the impact of this communication component is often ignored or trivialized.  One introductory computer science text [2] describes  how easy it is to replicate "embarrassingly parallel" sequential algorithms such as the Manhattan telephone book problem  to run on an MP only mentioning, in passing,  the complications of the induced communication.  In contrast, associative SIMD processors[3], or APs are general purpose and do not need multiprocessor communication overhead.  Indeed, in many cases, their utilization actually simplifies the code[4].

In many texts on parallel computing, Amdahl's law is discussed, ignoring overhead, as establishing the maximal speedup of parallel systems.  This  implies that overhead is a minor inconvenience for which a cure is available, it just has to be found.   Mischaracterizing the nature of MP overhead is detrimental for two reasons.  First, many programmers are lead to believe that MPs are the best approach to parallelism.  Second, it steals the "no overhead" advantage from APs.  That is, it implies that all parallel implementations with no or minor communication overhead are special cases, just as they are in MPs.  And that, real, serious AP applications will suffer from intractable overhead also.

The next sections will address MP versus AP overhead issues, including communication overhead, speedup and efficiency, partitioning the CPU, Flynn's taxonomy and Amdahl's law.  Conclusions and References follow.

 

2.0 Communication Overhead

 


The assumption that a sequential algorithm can be implemented on an MP by simple replication assumes that the additional code introduced when a sequential algorithm is converted into an MP algorithm can be ignored or at worst is bounded by a constant.  This is a major misunderstanding of the nature of control parallelism.  In reality, in some applications such as real-time scheduling, implementing a sequential algorithm on an MP converts it to an  HP-hard algorithm[5]. That is, this assumption allows an O(n) sequential algorithm to be replaced by a MP parallel NP‑hard algorithm without a true accounting of the cost.  If any simplification is to be employed, the basic sequential algorithm should be ignored and the analysis should concentrate on the communication overhead.

Figure 1 is a simple example to emphasize the obvious point that in the limit, overhead is dominate in MP implementations.  That is, when converting  the sequential pseudo code of Figure 1a  into the MP pseudo code of Figure 1b, four additional lines of code are needed for interprocessor communication and synchronization.  These are new,  induced, components not in the original sequential algorithm and they are dependent on p, the number of CPUs, not n,  the size of the problem.

 

for(I=0; I<n; I++)

                if (searchName == dataName[I])

                    return phoneNumber[I];

 

a - The Sequential Solution

 

synchronize

send the searchName to p processors;

for (I=0; I<n/p; I++)

                                                                 if (searchName == dataName[I])

                                 return phoneNumber[I];

                                                 synchronize

                                                 reduce answer

 

b ‑ The MP Solution

Figure 1 ‑ The Manhattan Telephone Book Problem

 

 

The induced communication complexity is more than processor synchronization, message passing and data movement.  Multiple processors introduce a real time scheduling aspect which can and frequently does dominate.  That is, the task of determining which processors are available, where the data is located, which are the best processors for a job and when should they do the job is a major component of the added complexity.  Scheduling of shared resources is an NP-hard problem[5].  The induced overhead  is not "small,"  "constant" or inconsequential.  Indeed it can not be ignored at all since  the time for communication dominates the analysis of the parallel algorithm.

 

3.0 Speedup and Efficiency

 

APs can achieve superlinear speedup by reducing communication overhead.  Consider the problem of finding a largest number in an unordered list.  If Fahlkoff's algorithm [6] is used, it can be done in constant time on an AP.  Implementing the algorithm on a conventional sequential 32 bit processor  requires that it be executed n/32 times for the basic searches, producing n/32 candidates.  Additional searches are required to find the largest number among the candidates.  These additional searches are equivalent to the communication component in a MP implementation.  Thus an AP implementation can reduce  the need for communication  for certain problems.  APs can be faster and more efficient than conventional machines, i.e.

APs ?<=?  conventional

Consider, a second problem - finding duplicate entries in an unordered list.  On a conventional machine, it is an O(n2) problem, on an AP it is O(n).  On a MP it becomes NP hard because of the overhead required to "share" the data between processors.  MPs are never more efficient than conventional processors, i.e.

 conventional < MPs

Therefore,

APs < MPs

These results suggest that conventional difficult tasks such as real time scheduling can be implemented on APs in polynomial time, but are NP-hard on MPs.

 

4.0 CPU = IPU + ALU

            The standard practice of coupling the data processing (ALU) and instruction  processing (IPU) components of a computer into a CPU is one major cause of inaccurate interarchitecture comparative analysis.  In order to more accurately compare algorithms between different architectures,    it is essential that the CPU be divided into IPU and ALU components.  In the case of algorithm analysis on a sequential machine, the analysis is the same as traditional analysis.  However, in the case of parallel architectures and algorithms, the importance of  communication between IPUs in control parallelism  can be addressed explicitly.

The separation of CPU into IPU and ALU suggests that interprocessor  complexity analysis should be separated into a communication/control parallel portion and data parallel portion.  Let us reconsider, the unordered Manhattan telephone book problem.  Traditional  analysis assigns an O(n) complexity in the  worst case on a sequential processor.  Dividing the analysis  into ALU and IPU parts gives 0 for the IPU part and O(n) for the  ALU part.  The relationship CPU = IPU + ALU holds.  Consider a p processor MP implementation.  In the worst case analysis, the ALUs have O(n/p) complexity and the IPU portion has O(2log p) complexity to account for the synchronize/broadcast and synchronize/retrieve components shown in Figure 1b.  The total is a combined O(2log p + n/p) complexity.  When p=n, the result is  O(2log p).  For an AP, when p=n, the complexity is O(1) because there is only one IPU. 

Let us consider a divide and conquer algorithm to distribute the data.  We will look at the work involved ‑ defined here as the sum of all of the moves for the IPU and compares for the ALU.  Consider first the overhead of dividing the telephone number entries among the p processors.  At each step, 2 of the data must be moved and there are a total of p‑1 IPUs  in­volved..  This results in a O((log p)/2 *(p‑1) + 2log p) or O((plog p)/2) complexity for the IPU portion.  Once the data has been distributed, the work to find the number is  O(n) for the ALUs.  The total work is O((plog p)/ 2 + n).

 In an AP, the data does not need to be distributed, thus there is no IPU component so the total work is O(n).  Note that the ALU work is the same for all architecture classes, O(n).  The IPU cost for the MP algorithm is O((p*log p)/2), but is 0 for the AP and conventional machine since they have only one IPU.  Thus we see that even for this trivial problem, using interarchitecture comparative  analysis produces  a clearer picture of the relationships, summarized in Table 1, between solving this problem on  different architectures.

 

Table 1 - Interarchitecture Comparative Anal­ysis

                  AP       conventional         MP

complexity       O(1)           O(n)              O(2log n)

work                O(n)            O(n)    O(n+n/2(log n))

It is often claimed that MPs are better than APs because they are more general purpose.  The fact is that both a single processor MP  and an AP are basically the same as conventional sequential machines.  The difference between them occurs when they are modified to increase their computing power.

MP achieve more power by replicating CPU's - both IPUs and ALUs.  APs achieve more power by extending the ALU only, both architectures are general purpose. 

   The replicated CPUs give flexibility to the MP architecture, but it  comes at a price - in­terprocessor communication.  The single IPU of the AP avoids any MP communication overhead.  This simplicity comes at a price, some algorithms may not be efficiently implemented.  The inefficiency is in underutilized ALUs.  Briefly, MPs are more complex to program and use IPUs inefficiently for communication overhead.  APs are simple to program and use ALUs inefficiently by leaving them idle when there is work to do.

5.0  MIMDs vs SISDs and SIMDs

 

Flynn's  taxonomy[7] of MIMD, SIMD, SISD and MISD has been supplemented with many more complex taxonomies to reflect the diversity of the many MP architectures, but it is still the most fundamental when distinguishing between architectures based on the number of IPUs and ALUs (SI/MI and SD/MD in Flynn's terms). MIMDs and SIMDs have traditionally been grouped together as parallel processors based on the fact that they both process data in parallel.  However, we feel that if  the CPU is divided into an ALU and IPU, then SIMDs should be grouped with SISDs as illustrated in Figure 2. Of the two components, the IPU component is by far the more powerful (and expensive) both in hardware and software.  That is, the IPU embodies the Turing machine /FSA capability central to the power of a von Neuman general purpose digital computer. It is the portion that carries out and requires interprocessor communication analysis.    The ALU by itself can only process two numbers. 

When dealing with different algorithms on similar architectures with similar CPUs,  separating the ALU and IPU components is not necessary, but when comparing the effectiveness of the machines on the same problem using different algorithms on different architectures, bundling the IPU and ALU into one component leads to grossly  inaccurate comparisons.  For example, traditionally, a p processor MP and an p processor SIMD would be considered to be of similar power.  But this is equating p IPU/ALUs against only one IPU with p ALUs.   It is the number of IPUs used, which determines the complexity of the software and the expense of communication, not the number or size of the ALUs. Thus, the grouping based on IPUs (i.e. MIMDs with MISDs and SIMDs with SISDs) is more profound than the traditional grouping based on ALUs. 


 

 SISD MISD         SISD MISD

 SIMD MIMD         SIMD MIMD

 

a - MD or ALU View           b - SI or IPU View

 

Figure 2 - Flynn's Taxonomy

 

An associative processor (AP) is a special type of  SIMD processor.  A bit serial SIMD computer can be thought of as  a SISD with a 512, 1024 or 2048 bit wide ALU.  This massive SIMD parallelism is used for parallel numerical  computation, but it also is used for massively parallel searching and can  simulate an associative memory consequently the associative processor moniker.  The primary difference be­tween an AP and a SISD is the extended ALU and the software needed to utilize it.  In an AP, the data is corner turned so that searching and bit serial algorithms (such as Falkoff's maximum value algorithm) are efficiently implemented. Address based computation in general and pointers in particular are a­voided. There are however a few hardware differences[4].  In an AP, a single responder can be selected by hardware in constant time from a group of responders to a search.  In addition, as with all SIMDs, the ALU and IPU are physically separated.  Frequently, the IPU is a complete processor by itself, having its own dedicated IPU, ALU and memory.  So that APs frequently have two IPUs, however, they are always organized in a master/slave relationship so that no communication overhead is incurred.   The close relationship of APs to  SISDs  versus MPs is evident in Table 1.

 

6.0 Amdahl's Law

 

Amdahl's law has been restated and rephrased in many ways over the years, so we have included the original quote here:

"A fairly obvious conclusion which can be drawn at this point is that the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of nearly the same magnitude.[8]"


Clearly Dr. Amdahl felt that speeding up the sequential portion of an algorithm was just as important as speeding up the parallel portion.  In our interpretation, given the discussion so far, we feel that "parallel processing" refers to IPU parallelism and that ALU "parallelism" would be acceptable as speeding up sequential processing.  If our interpretation is correct, then the claim in Section 3.0 that APs are faster than sequential machines is acceptable under Amdahl's law.  That is, Falkoff's algorithm on an extended IPU is simply a method of speeding up the sequential portion of an algorithm.  In light of the discussion of APs and MPs in this paper, it would seem that ignoring overhead, an additional burden required to achieve a high parallel processing rate, would violate the spirit of Amdahl's law.  That is, invoking privilege for a component that is known to be intractable for at least some MP algorithms and then using the violated law to impose limits on other types of parallelism seems unfair at the least.

 

7.0 Conclusion

 

An important application of complexity analysis is the ability to accurately compare prob­lem solutions across different architectures, sequential and parallel.  One critical aspect of MP analysis is the impact on complexity of the communications between processors.  We suggest that a more accurate comparison between architectures can be made if the CPU aspect of the traditional sequential processor analysis is divided  into an ALU component which reflects data parallelism and an IPU component which reflects control parallelism.  Such a separation will allow the communication component of MP implementations to be more accurately analyzed without penalizing data parallel components.   We point out that when interprocessor communication issues of control parallelism are analyzed, APs are closer to conventional processors than MPs are because they both only have a single instruction stream.

Traditionally, APs  have been considered as only suitable for large data parallel problems.  But APs are general purpose and we propose that the basic AP architecture be ex­­tensively considered for the next generation of computers. The primary problem in today's architecture designs is the von Neuman bottleneck which restricts the flow of data from memory to the CPU.   The AP is easily scalable to take advantage of any increase in size of the data path.  We feel that a more rigorous interarchitecture analysis of problems and applications will show that the multi-AP[4] is a good design  for large, time consuming problems, and that the standard AP is a food design for the next generation of the common working class general purpose computer.

 

8.0 References

 

[1] Kumar, V., et. al, INTRODUCTION TO PARALLEL COMPUTING, Benjamin/Cummings, Redwood City, CA., 1994.

[2] Schneider, G. and J. Gersting, AN INVITATION TO COMPUTER SCIENCE, PWS, 1998.

[3] Potter, J. L.,et. al, "ASC: An Associative Computing Paradigm," Computer, November, 1994, p. 19ff.

[4] Potter, J. L., ASSOCIATIVE COMPUTING, Plenum Press, New York, 1992.

[5] Stankovic, J. et. al, "Implications of Classical Scheduling Results for Real-Time Systems" in COMPUTER, June, 1995, pp.16-25.

[6] Falkoff, A., "Algorithms for Parallel‑Search Memories," J. Assoc. Comput. Mach. 9, 1962, p. 488ff.

[7] Flynn, M. J., "Some Computer Organizations and Their Effectiveness," IEEE TRANS. ON COMPUTERS, Vol C‑21, 1972, p. 948ff.

[9] Amdahl, G. M., "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities" in THE PROCEEDINGS OF AFIPS SPRING JOINT COMPUTER CONFERENCE, 1967, pp. 483-485.