Stanford Smart Memories Project
|
Advances in VLSI technology now permit multiple processors to reside on a single integrated circuit chip, or IC. Such a processing system is known as a chip multiprocessor, or multi-core CPU system. Building on this technology, the Stanford Smart Memories Project places several processors on an IC, along with several independent memory blocks. In addition, the processors can be connected to the memory blocks in various ways, with the ability to form and change connections even while the processors are running. This ability is a kind of reconfigurable computing. Depending on how processors talk to memory and to one another, they form a system that can be tailored more or less to a given style of computation. A fixed (non-reconfigurable) compute system might do well at supporting one style of computation, but consequently perform poorly on a different style. A reconfigurable computer, however, can adapt to many different styles of computing, and thus provide reasonably good performance across a wide range. Smart Memories has been shown to be effective for diverse compute styles including MESI-style shared-memory cache coherence, streaming and transactional memory. The Stanford Smart Memories Project is an effort to develop a computing infrastructure for the next generation of applications. It is a multicore system with coarse grain reconfiguration capabilities for supporting diverse computing models, like speculative multithreading and streaming architectures. These features allow the system to run a broad range of applications efficiently. Research in this area involves VLSI circuits, computer architecture, compilers, operating systems, computer graphics and computer networking. Smart Memories is a project in the Computer Systems Laboratory, a joint laboratory of the Electrical Engineering and Computer Science departments at Stanford University. Project overview The Stanford Smart Memories Project aims to design a single-chip computing element that provides configurable hardware support for diverse computing models, and that maps efficiently to future wire-limited VLSI technologies. The Smart Memory chip architecture exploits the fact that wire-delay limitations in future VLSI chips will impose a fine-grained partitioning of processors, memories, and interconnects. Adding programmable wires and logic to this inherently modular organization allows on-chip memories and communication paths to be customized to the particular computing problem at hand. This allows performance competitive with application-specific architectures, but with lower cost and increased flexibility. This fine-grained partitioning of processing and memory resources also enables substantial hardware parallelism. Effectively exploiting this parallelism in the face of global wiring delays requires aggressive methods for reducing on-chip communication overhead between the various processing and memory structures. To develop a configurable micro-architecture, the Smart Memories group is studying diverse classes of computing problems, (such as ray tracing, multimedia and DSP, speech and voice recognition, probabilistic reasoning), and the specialized architectures that have been optimized for these problems. This will provide insight into the hardware primitives and configurable mechanisms required to implement a universal computing substrate. The group is mainly interested in the requirements that such classes of applications place on the memory system of a multiprocessor environment, and are investigating strategies for building a reconfigurable memory system. Architecture overview Smart Memories is a multiprocessor system with coarse grain reconfiguration capabilities. Processing units in this system are in the form of tiles which, when put together in groups of four, form quads. These elements connect in a hierarchical manner: a set of inter-quad connections provide communication facilities for tiles inside a quad, while a mesh interconnection network connects quads together. Tiles inside a quad share a network interface to connect to the outside world (). Each tile in the Smart Memories system consists of four major parts (): two processor cores, a set of configurable memory mats, a cross bar interconnect and a load/store unit (LSU). Either or both of the processors inside the tile can be easily turned off, allowing a tile to be just a memory resource, and saving power, in the case that excess processing power is not required. Tile: Processors, memory mats, crossbar and LSU Processors Smart Memories leverages Xtensa LX commercial configurable processing cores from Tensilica. Cores are 32 bit RISC machines with a flexible 16/24 bit instruction length. The cores have been configured to be 3-way issue VLIW with flexible instruction formats. The Xtensa LX has a seven stage pipeline, with two stages for memory access. It has 64 general purpose registers, a 32-bit floating point unit and 32 floating point registers. Processors are configured and extended using the TIE (Tensilica Instruction Extension) Language. The Smart Memories group has defined new interfaces to the memory, plus state registers and custom instructions for supporting different programming models. Memory mats shows the block diagram of a reconfigurable memory mat. Each memory mat has 1024 32-bit words in its main data array. Each word is associated with six control bits in a separate control array. A programmable PLA performs a read-modify-write operation on the control bits after each access to the memory word. The mat can perform read, write and compare operations on each 32-bit data word. Each memory mat also has two pairs of pointer/stride registers, which can be used to implement two separate hardware FIFOs inside. Mats are connected via a two bit inter-mat communication network, which allows them to exchange control information. They can be configured to be used as cache, FIFO's or scratchpads. Crossbar A crossbar inside the tile connects the memory mats to the two processor cores inside the tile, and to the tile's quad interface. The crossbar has four ports at the processor (LSU) side, two ports to the quad interface and 16 ports to the memory mats. Load/Store Unit (LSU) A Load/Store unit interfaces the two Tensilica cores to the rest of the memory system. It provides basic interfacing and support for the custom memory operations that were defined using the TIE language. The LSU also communicates with the quad's cache controller to request cache refills, access off-tile memory and report other special events, such as synchronization misses. Quad: Four tiles, cache controller, network interface Each group of four tiles forms one quad. Each quad has a shared cache or protocol controller, which provides support for the processors inside. It also has a network interface, which sends/receives/routes packets on the mesh-like network, and provides an interface to the outside world. Cache (protocol) controller The protocol controller is considered to be the heart of the quad. It can perform a variety of actions to support the processors' memory access needs under different programming models. Briefly, the protocol controller services cache evictions/refills, provides access to memory mats in one tile for a processor in another tile (off-tile accesses), enforces cache coherence invariance (MESI protocol), acts as a DMA engine to move data in and out of the quad, and provides support for transactions. Network interface The network interface is a simple router that connects each quad to its neighbors via a set of wires. It receives packets from the protocol controller or other neighbors and routes them to appropriate destinations. Programming models / software Smart Memories is designed to efficiently support different programming models, allowing an application to be programmed and run in the model that gives the best performance and/or programming ease. Smart Memories can reconfigure its memory system to provide the unique memory access requirements for each of three major models: shared memory, streaming, and transactional consistency. Shared memory / multi-thread mode This programming model gives the programmer a cache coherent shared memory environment. Multi-thread programs are supported using different APIs, such as pthreads or ANL macros. There are on-going efforts to map different application classes to the Smart Memories architecture using this programming model, including probabilistic reasoning applications, global illumination and data structure pre-fetching. Probabilistic reasoning applications Probabilistic reasoning is an influential approach in artificial intelligence, where it has been shown to successfully tackle difficult problems in growing fields such as data mining, image analysis, robotics, and genetics. Given the increasingly complex models and large data sets used in these emerging applications, the performance of reasoning algorithms is likely to become important for future computing systems. These algorithms tend to be inherently parallel, but are demanding in compute, memory and bandwidth resources. By mapping these algorithms onto the Smart Memories architecture, we can evaluate the effectiveness of various reconfigurable components in our design. Global illumination on parallel architectures Monte-Carlo ray tracing to generate scenes with global illumination is an application that demands a lot from a memory system. The application has been coded using pthreads and simulated on the Smart Memories simulator. Although real-time performance on a single Smart Memories chip is achieved, higher performance over current processors is possible. Related publications: C. Burns, Global Illumination on Parallel Architectures, Senior Thesis, University of Texas Department of Computer Sciences, Dec. 2004 Data structure pre-fetching Hardware-based or compiler-assisted pre-fetching techniques work well for array-based programs but are less effective in hiding memory latency for pointer-intensive programs. By using a data structure centric approach to pre-fetching (as opposed to control-flow centric approaches), the Smart Memories project exploits libraries of data structures to help with pre-fetching data stored in the data structures. Taking advantage of the recent success of chip multiprocessors, an idle or under-utilized processor can pre-fetch data using a pre-fetch thread. A library is modified by adding code for the pre-fetch thread as well as a few lines to communicate information from the library code to the pre-fetch thread. The pre-fetch thread uses the knowledge about data structures in the library to identify the memory traversal patterns and issues pre-fetches accordingly. This is contrary to issuing pre-fetches for individual load instructions independently. This approach can obtain performance improvements without the assistance of any profiling-compiler or costly hardware even while restricted to the paradigm of sequential programming languages. Furthermore, this approach makes pre-fetching transparent to the programmer (using the library) as one need not modify the application code at all. Streaming Streaming is the second programming model supported in the Smart Memories system. For data parallel applications as in the multimedia and DSP domain, the stream programming model gives high performance. By separating a program's computation and communication into kernels and streams of data, a compiler can make a lot of static optimizations. A high level compiler such as Reservoir Labs R-Stream maps compute kernels to stream co-processors and manages the transfer of data to software managed local memories. It generates SVM (Stream Virtual Machine) code, C with SVM API calls, which is then compiled by a Tensilica XCC compiler. The SVM runtime implements the SVM API calls to allow a stream program to run on Smart Memories. Smart Memories is an active participant in the Morphware Forum, which develops standards such as the Stream Virtual Machine. Related publications: F.Labonte, P. Mattson, I. Buck, C. Kozyrakis and M. Horowitz, The Stream Virtual Machine, PACT, September 2004. Transactional Coherence and Consistency (TCC) The last major programming model in the Smart Memories system is transactions. By executing all codes as transactions on the memory system, TCC offers a simpler way to parallelize applications than by using different threads. For more details about TCC please refer to Stanford TCC website. Smart Memories test chips Memory test chip In February 2003, the Smart Memories group taped out a reconfigurable memory test chip on the TSMC 0.18um process. The test chip consisted of four memory blocks, a low swing crossbar, and testing infrastructure circuits. The chips were successfully tested in the lab, operating at 1.1GHz clock frequency at nominal voltage of 1.8 volts (Figure 1). Results were published in the 2004 ISSCC conference (K. Mai, R. Ho, E. Alon, D. Liu, Y. Kim, D. Patil, and M. Horowitz. Architecture and Circuit Techniques for a Reconfigurable Memory Block. ISSCC, February 2004). INSERT FIGURE HERE: Interconnect test chip In April 2002, Smart Memories taped out a low swing interconnect test chip on the TSMC 0.18um process (Figure 2). The test chip consisted of multiple low-swing bus topologies as well as some full-swing buses for comparison. The test chip also contained a sense amplifier offset measurement block (later re-spawn on a National Semiconductor 0.25um process). The chips have been tested and a paper is presented at the 2003 VLSI Circuits Symposium (R. Ho, K. Mai, M. Horowitz. Efficient On-Chip Global Interconnects. IEEE Symposium on VLSI Circuits, June 2003). INSERT FIGURE HERE:
|
|
|