By
Jerry Potter
Copyright 2004 by Jerry Potter
All current models of parallel computing assume a CW or data parallel programming model. That is, the programmer is encouraged to believe that each CPU in a parallel machine or cluster is independent of all others and can write to its own memory independently. However, in reality, the system software treates parallel system as one large address space, in effect, a single memory-multiple CPU structure. Unfortunately, the task of converting the assumed CW programming model into the EW hardware reality by parallel compilers and dynamic scheduling of the shared memory resource is very difficult. One possible solution are structured objects consisting of cells. A cell encapsulates the original von Neumann model of a unique data address space. It does not contain any instructions. It consists of a single ALU and a single dedicated memory. Structured objects may contain traditional data structures in addition to cells.
Many of today’s problems such as cluster I/O, the MIMD glass ceiling[1] and the WEB data server, are rooted in the von Neumann bottleneck. The traditional single instruction single data von Neumann model, the base case of conventional computing, assumes a single absolute address space in essence an EREW memory. However, when this model is replicated for multi tasking, distributed or parallel computing, the memory component of the logical programming model over the distributed shared memory becomes CRCW in nature. Unfortunately, the von Neumann model does not address the problem of multiple programs sharing a common data memory. That is, the conventional programmer assumes that the memory will work the same way for his threaded code as an isolated von Neumann program - that all components of his program are isolated in time and can write to memory without concern for aliasing or memory conflicts or violating the assumed monolithic address space. However, all memory devices by necessity are EW. Only one address can be written at a time. Conventionally, the CW assumption of the program is transformed into an EW component by the dynamic scheduling of the shared memory resource by using semaphores, barriers, rendezvous and other synchronization software. This discussion is obvious, however, it is important to point out that this basic fact/characteristic/nature/artifact of the von Neumann computing model is true of ALL multi-tasking and parallel computing approaches, be they clusters of computers, WEB client/servers, smart memories(or PPIM), virtual or real parallel computers.
As long as the speed of the CPUs and memories are
balanced and the memory to CPU bandwidth is balanced the synchronization task
is not critical. In addition, when the
speed of the CPU increases over that of the memory, the extra speed allows the
synchronization overhead to be hidden.
That is, the CPU can set and check many flags in the time it takes to
deliver any associated data. In
addition, special memory systems consisting of layers of caches have been
developed to accommodate ever increasing gaps between the speed of the CPUs and
memories.
However, in every non-SISD system, the fundamental
nature of the single memory-multiple CPU relationship remains the same and
unfortunately, the task of converting the assumed CW programming model into the
EW hardware reality by dynamic scheduling of the shared memory resource is very
difficult. This fact is becoming more and more obvious as the speed up in
hardware has allowed all control parallel tasks to be executed at maximum speed
so that the remaining difficult tasks, such as data mining, weather modeling
and ATC, are in essence data parallel.
These data intensive tasks can not be solved by mapping them onto any
kind of multi- tasking or parallel computer be it a cluster, a client/server, a
shared memory machine, a NUMA, etc.
because they all must address the difficult nature of the memory sharing
requirement.
The primary difference between these machines is where
they place the CS/EW conflict. The
parallel machines put it between the CPU and memory. The distributed computing, WEB servers and cluster machines put
it between the computer and the disk (i.e. it becomes an I/O problem). Some of the various approaches may be better
than others for certain applications, but they all hit the bottleneck at some
point.
: : :
What is needed is a model of computation that recognizes the single address space multiple ALU relationship and brings it to the surface so that the programmer can handle it efficiently. One possibility is a structured object. The confusion of Figure 2a in contrast to the structure of Figure 1b is reminiscent of the uncontrolled goto logic before the introduction of the structured code discipline. We proposed to introduce a structured object discipline that like structured code imposes restrictions on the interaction between ALUs and memories reduceing the impact of the complexity. See Figure 2b.
The main concept is a cell. Structured objects may contain cells in addition to the
traditional data structures. A cell
encapsulates the original von Neumann model.
It consists of a single ALU and a single dedicated memory. By definition no component of a cell may be
replicated, however, cells themselves may be replicated (and a cell may serve
as the Amemory@ component of a
higher order cell). A cell isolates the
control parallel component represented by the IS (Instruction Stream) of a
program from the data parallel component which consists of an ALU and its
MEMORY. The division of computation
into an IS + (ALU+MEMORY) organization versus an (IS+ALU) + MEMORY organization
has a major impact on the ability to accurately characterize and compare
grossly different architectures such as sequential machines, cluster machines,
client servers and parallel machines.
The cell structure in an object supports data parallel
code directly. Random processes can not
talk to random memories inside a structured object. It is not that this is not useful, just as random jumps can be
useful, it is that a restriction of this nature greatly reduces the complexity
of the design both in software and hardware.
This is a software model so that the mapping of a
program written using this model may differ dramatically from hardware to
hardware. This model has been
implemented on a sequential machine with Agood
results.@ AGood results@
mean that the model is easy to use but the additional overhead in emulating
parallel environments in a sequential machine results in slower execution. All parallel, distributed, cluster
implementation should show significant improvement in programming and
execution.