Copyright 2004 by Jerry Potter
All current models of parallel computing assume a CW or data parallel programming model. That is, the programmer is encouraged to believe that each CPU in a parallel machine or cluster is independent of all others and can write to its own memory independently. However, in reality, the system software treates parallel system as one large address space, in effect, a single memory-multiple CPU structure. Unfortunately, the task of converting the assumed CW programming model into the EW hardware reality by parallel compilers and dynamic scheduling of the shared memory resource is very difficult. One possible solution are structured objects consisting of cells. A cell encapsulates the original von Neumann model of a unique data address space. It does not contain any instructions. It consists of a single ALU and a single dedicated memory. Structured objects may contain traditional data structures in addition to cells.
Many of today’s problems such as cluster I/O, the MIMD glass ceiling and the WEB data server, are rooted in the von Neumann bottleneck. The traditional single instruction single data von Neumann model, the base case of conventional computing, assumes a single absolute address space in essence an EREW memory. However, when this model is replicated for multi tasking, distributed or parallel computing, the memory component of the logical programming model over the distributed shared memory becomes CRCW in nature. Unfortunately, the von Neumann model does not address the problem of multiple programs sharing a common data memory. That is, the conventional programmer assumes that the memory will work the same way for his threaded code as an isolated von Neumann program - that all components of his program are isolated in time and can write to memory without concern for aliasing or memory conflicts or violating the assumed monolithic address space. However, all memory devices by necessity are EW. Only one address can be written at a time. Conventionally, the CW assumption of the program is transformed into an EW component by the dynamic scheduling of the shared memory resource by using semaphores, barriers, rendezvous and other synchronization software. This discussion is obvious, however, it is important to point out that this basic fact/characteristic/nature/artifact of the von Neumann computing model is true of ALL multi-tasking and parallel computing approaches, be they clusters of computers, WEB client/servers, smart memories(or PPIM), virtual or real parallel computers.
As long as the speed of the CPUs and memories are balanced and the memory to CPU bandwidth is balanced the synchronization task is not critical. In addition, when the speed of the CPU increases over that of the memory, the extra speed allows the synchronization overhead to be hidden. That is, the CPU can set and check many flags in the time it takes to deliver any associated data. In addition, special memory systems consisting of layers of caches have been developed to accommodate ever increasing gaps between the speed of the CPUs and memories.
However, in every non-SISD system, the fundamental nature of the single memory-multiple CPU relationship remains the same and unfortunately, the task of converting the assumed CW programming model into the EW hardware reality by dynamic scheduling of the shared memory resource is very difficult. This fact is becoming more and more obvious as the speed up in hardware has allowed all control parallel tasks to be executed at maximum speed so that the remaining difficult tasks, such as data mining, weather modeling and ATC, are in essence data parallel. These data intensive tasks can not be solved by mapping them onto any kind of multi- tasking or parallel computer be it a cluster, a client/server, a shared memory machine, a NUMA, etc. because they all must address the difficult nature of the memory sharing requirement.
The primary difference between these machines is where they place the CS/EW conflict. The parallel machines put it between the CPU and memory. The distributed computing, WEB servers and cluster machines put it between the computer and the disk (i.e. it becomes an I/O problem). Some of the various approaches may be better than others for certain applications, but they all hit the bottleneck at some point.
: : :
What is needed is a model of computation that recognizes the single address space multiple ALU relationship and brings it to the surface so that the programmer can handle it efficiently. One possibility is a structured object. The confusion of Figure 2a in contrast to the structure of Figure 1b is reminiscent of the uncontrolled goto logic before the introduction of the structured code discipline. We proposed to introduce a structured object discipline that like structured code imposes restrictions on the interaction between ALUs and memories reduceing the impact of the complexity. See Figure 2b.
The main concept is a cell. Structured objects may contain cells in addition to the traditional data structures. A cell encapsulates the original von Neumann model. It consists of a single ALU and a single dedicated memory. By definition no component of a cell may be replicated, however, cells themselves may be replicated (and a cell may serve as the Amemory@ component of a higher order cell). A cell isolates the control parallel component represented by the IS (Instruction Stream) of a program from the data parallel component which consists of an ALU and its MEMORY. The division of computation into an IS + (ALU+MEMORY) organization versus an (IS+ALU) + MEMORY organization has a major impact on the ability to accurately characterize and compare grossly different architectures such as sequential machines, cluster machines, client servers and parallel machines.
The cell structure in an object supports data parallel code directly. Random processes can not talk to random memories inside a structured object. It is not that this is not useful, just as random jumps can be useful, it is that a restriction of this nature greatly reduces the complexity of the design both in software and hardware.
This is a software model so that the mapping of a program written using this model may differ dramatically from hardware to hardware. This model has been implemented on a sequential machine with Agood results.@ AGood results@ mean that the model is easy to use but the additional overhead in emulating parallel environments in a sequential machine results in slower execution. All parallel, distributed, cluster implementation should show significant improvement in programming and execution.