|
Abstract : |
NUMAchine is a cache-coherent shared-memory multiprocessor designed to have high-performance, be cost-effective, modular, and easy to program for efficient parallel execution. Processors, caches, and memory are distributed across a number of stations interconnected by a hierarchy of unidirectional bitparallel rings. The simplicity of the interconnection network permits the use of wide datapaths at each node, and a novel scheme for routing packets between stations enables high-speed operation of the rings in order to reduce latency. The ring hierarchy provides useful features, such as efficient multicasting and order-preserving message transfers, which are exploited by the cache coherence protocol, for low-latency invalidation of shared data. The hardware is designed so that cache coherence traffic is restricted to localized sections of the machine whenever possible. NUMAchine is optimized for applications with good locality, and system software is designed to maximize locality. Results from detailed behavioral simulations to evaluate architectural tradeoffs indicate that a prototype implementation will perform well for a variety of parallel applications. 1, |