Limits on High Performance Applications

The Parallel Computing Ceiling

Jerry L. Potter

Department of Computer Science

Kent State University

Kent, OH 44242

potter@cs.kent.edu

1.0 Introduction

Since the beginnings of computing, the parallel computing ceiling, as illustrated in Figure 1, has limited the number of processors that can be effectively utilized simultaneously. Many parallel applications do not hit the parallel ceiling because they use limited resources and operate in the near linear portion of the curve. Advances in hardware cost reduction and speed improvements have provided substantially more throughput so that over time there is apparent progress. The result is the impression that parallel computation can, in general, be extended forever on a linear or near lineal projection. Thus it is assumed that since cluster hardware is plentiful, it is just a matter of figuring out how to program them to achieve supercomputer performance. Indeed some researchers have reported algorithms that have broken the parallel ceiling, even exceeding it in a superlinear manner using statistical approaches. Statistical approaches work well for average cases, but do not address real time or worst case situations. Other applications have

Figure 1 – The Parallel Ceiling – Throughput versus Number of Processors

been parallelized using a minimum of communication between parallel components, accounting for near linear algorithms that extend the graph to the right. But most of today’s remaining applications such as data mining, air traffic control, and WEB routing are proving very difficult to conquer even using large numbers processors.

One common thread that exists among these demanding applications is the need for a dynamic real-time database. A dynamic real-time database is in essence a single global resource shared by all processors. In order for multiple processors to work on the same resource, they must be synchronized. This introduces a complicating communication component not present in any sequential algorithm. Indeed, it has been shown that real time dynamic scheduling, also encumbered with synchronization overhead, is intractable [1] and that some tasks such as Air Traffic Control (ATF) are NP hard [2].

2.0 Single Global Address Space

In order to efficiently search databases, they must be sorted, this makes them a single global resource. Sorting is a sequential task. It can not be done completely in parallel. Even a hybrid system whereby the individual packets of data are sorted in parallel on separate machines, and then only globally sorted when necessary is unacceptable for a large dynamic real time database. Normally, all would work well, but in the worse case, when new data is added and the processor to which it is assigned is full, the overflow must be cascaded to one or more of its neighbors, and the sort becomes sequential. No matter if an ever larger portion can be done in parallel, some portion must be done sequentially. Amdahl's law [3] states that it is this sequential portion that dominates parallel sorting algorithms.

Moreover, the transition from many parallel tasks to a sequential one adds complexity to the original sequential algorithm. It creates a parallel-to-sequential reduction bottleneck that must be resolved by introducing communication overhead. This communication overhead has O(n²) complexity [4] where n is the number of processors. Thus at some point adding more processors to solve an intractable problem makes it worse not better as illustrated by Figure 1. That is, when the increased cost of the overhead complexity exceeds the savings gained on the execution of the original algorithm, the throughput actually decreases as more processors are added. Unfortunately, in many algorithm analyses, the communication component is ignored even though by Amdahl=s law it is an important, possibly the dominate component. This attitude is so pervasive that textbooks actually teach that it is acceptable to ignore such factors [5].

The difficulty of sorting a global database in parallel is due to the need for a single global address space inherent in the basic von Neumann model of computation which is characterized as A..data are kept in named locations, and correct computational behavior is enforced by the programmer who describes the exact order in which computations are to be performed [6].@ The single address space concept causes other complications as well. That is, complex auxiliary data structures such as linked lists, queues, hash tables, heaps and the algorithms to support them are required by the single address space model. Even the quintessential computer science problem of dynamic memory allocation and garbage collection is a result of the von Neumann model of computation.

2.1 The Impact on Hardware

The single address space bottleneck caused by the von Neumann model is carried over into today’s conventional architecture. For example, large volumes of data must be processed by looping. Each datum is processed one at a time by repeatedly fetching the same instructions over and over. Parallelism does not help, the same number of instructions (in fact more if overhead is counted) must be fetched whether that data is processed in a single sequential machine or in a parallel cluster. This requirement puts a strain on hardware design. As advances make possible the access and manipulation of larger volumes of data and instructions in parallel with word widths of 64, 128 and even 256 bits, baroque designs such as the very wide word architecture must be instituted to accommodate fetching several sequential instructions simultaneously even though there is a very high probability that some of the instructions will not be executed due to loop branches. There is a basic imbalance on the ability to fetch large numbers of instructions and execute them efficiently. Yet the through put in compute intensive applications is often dependent on the amount of data that can be delivered in parallel. Skillicorn=s von Neumann uniprocessor design contains a separate memory for program and data would help reduce the memory to CPU congestion but with a conventional single combined memory design, the burden of sharing it efficiently falls on the compiler and other support software such as profilers. These software tools are run "off-line" and can not adapt to real time or worst case situations.

Not only are the straight forward advances such as wide memory-to-processor busses hindered by the von Neumann model, but the innovative approaches to computing hardware such as processor-in-memory or smart memories are also limited [7]. Similarly, the promise of nano-computation, i.e. bio, molecular and optical computing, are hindered. Any new initiative to build a super computer based on the conventional hardware implementation of the von Neumann model for specific areas of research can not produce any great advancements because of the hindrances of the von Neumann model on parallel computing.

2.2 The Impact on Software

Conventional programming languages can not solve the single address space limitation of the von Neumann model. Indeed, the von Neumann model of computation is the foundation upon which all modern procedural languages such as C++ and Java are based. Modern software techniques such as structured programming and object oriented programming don't change the sequential nature of von Neumann computing, they only complicates it. OOP tries to hide the complexity of the sequential nature of the von Neumann model using techniques such as encapsulation and inheritance, but ultimately it only makes programming more complex. A..in some sense, the language used on a particular machine corresponds to the model of computation that it uses. If this is not the case, then there is a very bad >fit= between language and implementation ...[6].@

Debugging is a major component of the software design and development effort. Data hiding and the black box approach to programming makes debugging harder. By definition, bugs exist where they are not expected. And by design of the von Neumann model, computers execute code sequentially. Therefore, to a large extent, debugging must be sequential in nature even though the bug itself may not be. That is, a bug may be due to unanticipated data, a design flaw or a simple typing error, but to find the actual section of code responsible for the bug requires that the program be traced sequentially from the beginning. Tracing the sequential execution of a program in a data hiding and inherited object environment is extremely complex since any one of a number of classes in a hierarchy may contain the offending code and the code for any one class may be located in several different locations. The point that software design principles may have been violated in the design of the program is moot because again, by definition, a bug is a mistake and/or a design violation.

3.0 Alternatives to the von Neumann Model

Despite the stated limitations, the von Neumann model of computing has been very successful in solving many complex problems. However, many of the problems which dominate today’s parallel computing research activities require a single global address space when using the von Neumann model. It would appear that the state of the art of modern hardware design, algorithms, software and languages are banging against the ceiling limit imposed by the von Neumann model. No computing system built under this model for Aintractable problems@ has been able to combine more than a few (8 to 16) processors together whether in a NUMA, a SUMA, a SMP or a cluster or any other configuration to efficiently overcome the problems imposed by the single address space.

There are alternative computing models. For example, dataflow, systolic arrays and associative among others. Consider the associative computing model, it addresses the single address space problem directly by using a two dimensional address space instead of a single global address space. A two dimensional memory maps easily onto natural conventional two dimensional data organizations such as arrays, tables and charts. One dimension can be treated as sequential while the other is parallel. Data can be processed in the parallel dimension without looping and logical conditions are handled by masking thus minimizing branching. The associative model is easy to use and because memory is partitioned into program memory and data memory as in the von Neumann uniprocessor machine design the hardware can be optimized for each task. The parallel von Neumann array processors hardware implementations are well suited for the associative model [6].

3.1 The Associative Memory Model

Records in the associative model are accessed by content, not address. Records can be found in constant time regardless of their location in memory. Records are retrieved based on inherent data such as date of order or length of time on back order or both simultaneously. There is no need to reorganize data to optimize a search or for auxiliary fields or data structures such as inverted lists. Since there is no single global address space, there is no need to sort and consequently all data can be posted as soon as it arrives in the system.

Logical relationships between records, such as those requiring trees or graphs, are implemented directly in structure codes, not indirectly by layers of indirect addressing. Structures codes are like any other field in a record. Thus the data at the nodes and the structural relationship of the nodes can both be accessed directly and in constant time. Instructions required to search the logical structure are open and direct, they are not obscured by extraneous operations demanded by auxiliary data structures such as pointer chasing. Given a record, its structure code can be used to find any other record's structure code related to it in constant time, e.g. parent, children, siblings, nearest left sibling, nearest right sibling, etc.

Since data is located by content, it can be placed in any associative cell, when new memory is needed, it can be allocated from any available data memory. And when it is no longer needed, it is simply flagged as "idle." There is no need for sorting. There is no need for heaps or garbage collection.

3.2 Associative Programming

The associative model is easy to use and program, closely paralleling human thought. Human thought is basically sequential, but we have no problem understanding repetitive execution of a single task. Thus when we plan, we think, Awash our hands then wash the dishes@, with no concern about how many hands or dishes are being washed. The sequential process of planning and the "parallel" process of execution are different and separate. In the associative model, there is a definite distinction between program or planning memory and data or execution memory. Program storage, fetching and execution is sequential and the hardware is the same as in the conventional computers. However, data is stored in a separate memory. It is accessed associatively and processed in parallel without any concern as to how many records there are. Consequently, in the associative model there is no commingling of planning tasks and executing tasks. For example, FOR loops are not required to process arrays of similar data, such tasks are accomplished with one statement in a program. The conversion of "natural parallel thoughts" into loops of sequential control is a cause of considerable complexity in many of todays modern programming languages.

Humans also use associative techniques in every day life. They can not remember a persons name (i.e. their "address"), but can remember what they were wearing, the color of their hair, etc. (i.e. some properties "associated" with the person). This natural ability to remember descriptions rather than addresses is the basis of natural language [8]. Indeed, a type of associative memory is used in programming languages in the form of hash tables. That is, you access a symbol table record by the symbol's name, not by the address of the location in which it was stored. Of course in the von Neumann model, in order to use hash tables you must incur overhead to translate the name into an address in a special address space and once in address form you get involved with the problems of keeping the address space organized - i.e. sorted. Programmers appreciate the convenience of hash functions, but they use them only sparingly because of the inefficient memory utilization, slow execution and they provide only a crude associative capability. The associative model provides comprehensive associative access with the easy of use of hash tables and none of the memory or execution overhead.

3.3 Associative Hardware

Traditional associative memories are comparatively expensive. However, associative computer designs which use conventional sequential hardware components have been around for years [9] and fall into Skillicorn=s class of von Neumann Aarray processors@ having a single instruction memory and processor and multiple data processors and memories [6]. The main reason for the lack of acceptance of these computers is a poor understanding of the associative model versus the von Neumann model. Indeed most of these computers such as the Connection Machine were programmed using the von Neumann model [10].

4.0 Pro von Neumann Biases

The associative model holds many advantages over the von Neumann model, but it is not necessarily the solution to all of today’s computing problems. Much more research and development is required. For example, parallel associative programming needs to be explored. However, the biases of the single memory von Neumann programming model need to be made clear, so that the distinction between it and other models can be accurately compared. Much, virtually all, of todays computer science is not only dominated by but is biased towards the conventional single memory von Neumann model. For example, in complexity analysis, the effect of a shared program/data memory is ignored. Moreover, the efficiently of a parallel algorithm is based on the standard of the ratio of its performance to that of the best sequential (that is, the conventional single memory von Neumann model) algorithm. When alternative algorithms from other programming models claim to accomplish tasks "faster" than possible in the von Neumann model, it is said to be impossible. This misunderstanding leads to the Falkoff dichotomy. Specifically, Falkoff’s maximum value algorithm can find the maximum value in an unsorted list of numbers in constant time, while it takes O(n) to find the maximum value in the von Neumann model. Since everyone knows that an improvement of O(n) is impossible, computer scientists analyze the algorithm and associative model rejecting assumptions that they take for granted in the von Neumann model (such as the speed of electrons in a wire) and in essence penalize the associative model for possessing an intelligent memory. As a result, considerable effort has been expended to prove that the assumptions for the associative model are just as valid as the assumptions for the von Neumann model [11].

In the von Neumann model, control parallelism is used to achieve data parallelism, thus in complexity analysis, when comparing algorithms from the same programming model, it is possible to combine the cost of program execution (i.e. the instruction stream or IS) with the cost of data processing (i.e. the ALU) into a single entity, the CPU. Because of this von Neumann bias, associative computers in general, and Falkoff's algorithm in particular is charged with multiple IS where only one is used [12]. The separation of CPU into ALU and IS components allows a much more accurate analysis of parallel architectures such as clusters where the true cost of communication overhead can be separated and analyzed. For example, as described above, it is much easier to program in a data parallel model than a control parallel one. Separating CPU into ALU (data) and IS (control) components will give insight not only into the most efficiently algorithms for execution, but is important for hardware design as well, since the cost of an IS and its need for complex cache memory hierarchies are as much if not more than the cost of the ALU. The recognition of the true cost of these components is necessary if more efficient computer implementations are to be achieved and the full promise of computer nano-technology recognized.

5.0 Conclusion

Certain problems are intractable using the von Neumann model of computation. They are characterized as requiring a global, real time database. By Amdahl’s law, after the parallel component has been reduced to zero, adding more processors increases the communication demands of the remaining sequential portion actually reducing through put, not increasing it. The parallel ceiling is due to the von Neumann model’s use of a single global address space and can not be overcome by new or different parallel configurations. Specifically, clusters and “new” supercomputer designs suffer from the same constraint as the older traditional parallel machines – reliance on the von Neumann model of computing. Faster cheaper components allow more computation in a given time for a given dollar amount, giving the illusion that progress in parallel computing is being made, but it is the von Neumann computing model itself that causes the ceiling. The hardware just raises it, it can not eliminate it.

Alternative models such as associative computing do exit which eliminate the reliance on a single address space. These models need much more exploration if today’s intractable problems are to be solved and the promise of massive nano-based computation is to be achieved.

6.0 Bibliography

[1]. Stankovitc, J. A., M. Spuri, M. Di Natel and G. C. Buttazzo, AImplementations of Classical Scheduling Results for Real-Time Systems,@ Computer, June 1995, pp. 16-25.

[2]. Meilander, Willard, J. Baker and J. Potter, “Predictability for Real-Time Command and Control, The Proceedings of the 15th International Parallel and Distributed Processing Symposium, San Francisco, April 27, 2001.

[3]. Amdahl, G., AValidity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,@ Spring Joint Computer Conference, 1967, pp. 483-485.

[4]. Potter, Jerry L. ALimits on High Performance Applications” in Proceedings of the SPIE=s International Symposium Applications@ in, ITCom 2001, Commercial Applications for High-Performance Computing, Denver, CO. August 21-22, 2001, pp. 61-67.

[5]. Schneider, G. and J. Gerstring, An Invitation to Computer Science, PWS, 1998.

[6]. Skillicorn, David B., AA Taxonomy for Computer Architectures@ in Computer, November, 1988, pp. 46-57.

[7]. Patterson, D., et. Al., AIntelligent RAM(IRAM): The Industrial Setting, Applications and Architectures,@ ICCD 97, The International Conference on Computer Design, Austin, Texas, Oct. 10-12, 1997.

[8]. Winston, Patrick, Artificial Intelligence, Addison-Wesley, Reading, MA, 1984.

[9]. Rudolph, J. A., ASTARAN A Production Implementation of an Associative Array Processor@ in Proceedings of the Fall Joint Computer Conference, December, 1972.

[10]. Connection Machine Model CM-2 Technical Summary, Thinking Machines Corporation Technical Report Series HA87-4, 1987.

[11]. Jin, M., J. Baker and K. Batcher, Timings for Associative Operations on the MASC model, in The Proceedings of the 15th International Parallel and Distributed Processing Symposium, San Francisco, April 27, 2001.

[12]. Potter, Jerry L. And W. C. Meilander, AInterarchitecture Comparative Analysis@ in Proceedings of the International Conference on Communication in Computing, Monte Carlo Resort, Las Vegas, NV, June 26-29, 2000.