The challenge is to develop an efficient encoding of an application’s parallel dependency graph and to reduce the area and power consumption of the micro architecture that will execute this dependency graph. All these challenges are met by unifying the vector and multithreaded execution models with the vector-thread (VT) architectural paradigm. VT allows large amounts of structured parallelism to be compactly encoded in a form that allows a simple micro architecture to attain high performance at low power by avoiding complex control and datapath structures and by reducing activity on long wires. The VT programmer’s model extends a conventional scalar control processor with an array of slave virtual processors (VPs). VPs execute strings of RISC-like instructions packaged into atomic instruction blocks (AIBs). To execute data-parallel code, the control processor broadcasts AIBs to all the slave VPs. To execute thread parallel code, each VP directs its own control flow by fetching its own AIBs. Implementations of the VT architecture can also exploit instruction-level parallelism within AIBs. In this way, the VT architecture supports a
modeless intermingling of all forms of application parallelism. This flexibility provides new ways to parallelize codes that are difficult to vectorize or that incur excessive synchronization costs when threaded. Instruction locality is improved by allowing common code to be factored out and executed only once on the control processor, and by executing the same AIB multiple times on each VP in turn. Data locality is improved as most operand communication is isolated to within an individual VP.
.
SCALE, a prototype processor, is an instantiation of the vector-thread architecture designed for low-power and high-performance embedded systems. As transistors have become cheaper and faster, embedded applications have evolved from simple control functions to cellphones that run multitasking networked operating systems with real time video, three-dimensional graphics, and dynamic compilation of garbage collected languages. Many other embedded applications require sophisticated high-performance information processing, including streaming media devices, network routers, and wireless base stations. Benchmarks taken from these embedded domains can be mapped efficiently to the SCALE vectorthread architecture. In many cases, the codes exploit multiple types of parallelism simultaneously for greater efficiency
No comments:
Post a Comment