Stanford Smart Memories Project

Advances in VLSI technology now permit multiple processors to reside on a single integrated circuit chip, or IC. Such a processing system is known as a chip multiprocessor, or multi-core CPU system. Building on this technology, the Stanford Smart Memories Project places several processors on an IC, along with several independent memory blocks. In addition, the processors can be connected to the memory blocks in various ways, with the ability to form and change connections even while the processors are running. This ability is a kind of reconfigurable computing.
Depending on how processors talk to memory and to one another, they form a system that can be tailored more or less to a given style of computation. A fixed (non-reconfigurable) compute system might do well at supporting one style of computation, but consequently perform poorly on a different style. A reconfigurable computer, however, can adapt to many different styles of computing, and thus provide reasonably good performance across a wide range. Smart Memories has been shown to be effective for diverse compute styles including MESI-style shared-memory cache coherence, streaming and transactional memory.
The Stanford Smart Memories Project is an effort to develop a
computing infrastructure for the next generation of applications. It is a
multicore system
with coarse grain reconfiguration capabilities for supporting
diverse computing models, like speculative multithreading and
streaming architectures. These features
allow the system to run a broad range of applications efficiently.
Research in this area involves VLSI circuits,
computer architecture, compilers, operating systems,
computer graphics and computer networking.
Smart Memories is a project in the
Computer Systems Laboratory,
a joint laboratory of the
Electrical Engineering
and Computer Science
departments at Stanford University.
Project overview
The Stanford Smart Memories Project aims to design a
single-chip computing element that provides configurable hardware
support for diverse computing models, and that maps efficiently to future
wire-limited VLSI technologies.
The Smart Memory chip architecture exploits the fact that wire-delay
limitations in future VLSI chips will impose a fine-grained
partitioning of processors, memories, and interconnects. Adding
programmable wires and logic to this inherently modular organization
allows on-chip memories and communication paths to be customized to
the particular computing problem at hand. This allows performance
competitive with application-specific architectures, but with lower
cost and increased flexibility. This fine-grained partitioning of
processing and memory resources also enables substantial hardware
parallelism. Effectively exploiting this parallelism in the face of
global wiring delays requires aggressive methods for reducing on-chip
communication overhead between the various processing and memory
structures.
To develop a configurable micro-architecture, the Smart Memories group
is studying diverse
classes of computing problems, (such as ray tracing, multimedia and
DSP, speech and voice recognition, probabilistic reasoning), and the
specialized architectures that have been optimized for these problems.
This will provide insight into the hardware primitives and
configurable mechanisms required to implement a universal computing
substrate. The group is mainly interested in the requirements that such
classes of applications place on the memory system of a multiprocessor
environment, and are investigating strategies for building a
reconfigurable memory system.
Architecture overview
Smart Memories is a multiprocessor system with coarse grain
reconfiguration capabilities. Processing units in this system are
in the form of tiles which, when put together in groups of four,
form quads. These elements connect in a
hierarchical manner: a set of inter-quad connections provide
communication facilities for tiles inside a quad, while a mesh
interconnection network connects quads together. Tiles inside a quad
share a network interface to connect to the outside world
().
Each tile in the Smart Memories system consists of four major parts
():
two processor cores, a set of configurable memory mats, a cross bar
interconnect and a load/store unit (LSU). Either or both of the processors
inside the tile can be easily turned off, allowing a tile to be just
a memory resource, and saving power, in the case that excess processing
power is not required.
Tile: Processors, memory mats, crossbar and LSU
Processors
Smart Memories leverages
Xtensa LX
commercial configurable processing cores from Tensilica.
Cores are 32 bit RISC machines with a flexible 16/24 bit instruction length.
The cores have been configured to be 3-way issue VLIW with flexible instruction
formats. The Xtensa LX has a seven stage pipeline, with two stages for memory
access. It has 64 general purpose registers, a 32-bit floating point unit and 32
floating point registers.
Processors are configured and extended using the TIE (Tensilica Instruction Extension)
Language. The Smart Memories group has defined new interfaces to the
memory, plus state registers and custom instructions for supporting
different programming models.
Memory mats
shows the block diagram of a reconfigurable memory mat.
Each memory mat has 1024 32-bit words in its main data array. Each word is
associated with six control bits in a separate control array. A programmable
PLA performs a read-modify-write operation on the control bits after each access
to the memory word. The mat can perform read, write and compare operations on
each 32-bit data word.
Each memory mat also has two pairs of pointer/stride registers,
which can be used to implement two separate hardware FIFOs inside. Mats are
connected via a two bit inter-mat communication network, which allows them to
exchange control information. They can be configured to be used as cache,
FIFO's or scratchpads.
Crossbar
A crossbar inside the tile connects the memory mats to the two
processor cores inside the tile, and to the tile's quad interface.
The crossbar has four ports at the processor (LSU) side, two ports to
the quad interface and 16 ports to the memory mats.
Load/Store Unit (LSU)
A Load/Store unit interfaces the two Tensilica cores to the rest of the
memory system. It provides basic interfacing and support for the custom
memory operations that were defined using the TIE language. The LSU also
communicates with the quad's cache controller to request cache refills, access
off-tile memory and report other special events, such as synchronization misses.
Quad: Four tiles, cache controller, network interface
Each group of four tiles forms one quad. Each quad has a shared
cache or protocol controller, which provides support for the processors
inside. It also has a network interface, which sends/receives/routes
packets on the mesh-like network, and provides an interface to
the outside world.
Cache (protocol) controller
The protocol controller is considered to be the heart of the quad. It
can perform a variety of actions to support the processors'
memory access needs under different programming models. Briefly,
the protocol controller services cache evictions/refills,
provides access to memory mats in one tile for a processor in
another tile (off-tile accesses), enforces cache coherence invariance
(MESI protocol), acts as a DMA engine to move data in and out of the
quad, and provides support for transactions.
Network interface
The network interface is a simple router that connects each quad to
its neighbors via a set of wires. It receives packets from the protocol
controller or other neighbors and routes them to appropriate destinations.
Programming models / software
Smart Memories is designed to efficiently support different programming models,
allowing an application to be programmed and run in the model that gives the
best performance and/or programming ease. Smart Memories can reconfigure its
memory system to provide the unique memory access requirements for each of
three major models: shared memory, streaming, and transactional consistency.
Shared memory / multi-thread mode
This programming model gives the programmer a cache coherent shared memory
environment. Multi-thread programs are supported using
different APIs, such as pthreads or ANL macros. There are on-going efforts to
map different application classes to the Smart Memories architecture using
this programming model, including probabilistic reasoning applications,
global illumination and data structure pre-fetching.
Probabilistic reasoning applications
Probabilistic reasoning is an influential approach in
artificial intelligence, where it has been shown
to successfully tackle difficult problems in growing fields such as data
mining, image analysis, robotics, and genetics. Given the increasingly complex
models and large data sets used in these emerging applications, the
performance of reasoning algorithms is likely to become important for
future computing systems. These algorithms tend to be
inherently parallel, but are demanding in compute, memory and bandwidth
resources. By mapping these algorithms onto the Smart Memories architecture,
we can evaluate the effectiveness of various reconfigurable components
in our design.
Global illumination on parallel architectures
Monte-Carlo ray tracing to generate scenes with global illumination is an
application that demands a lot from a memory system. The
application
has been coded using pthreads and simulated on the Smart Memories simulator.
Although real-time performance on a single Smart Memories chip is achieved,
higher performance over current processors is possible.
Related publications: C. Burns,
Global Illumination on Parallel Architectures,
Senior Thesis, University of Texas Department of Computer Sciences, Dec. 2004
Data structure pre-fetching
Hardware-based or compiler-assisted pre-fetching techniques work well for
array-based programs but are less effective in hiding memory latency for
pointer-intensive programs. By using a data structure centric approach to
pre-fetching (as opposed to control-flow centric approaches), the Smart
Memories project exploits libraries of data structures to help with
pre-fetching data stored in the data structures. Taking advantage of
the recent success of chip multiprocessors, an idle or under-utilized
processor can pre-fetch data using a pre-fetch thread.
A library is modified by adding code for the pre-fetch thread as well as a
few lines to communicate information from the library code to the pre-fetch
thread. The pre-fetch thread uses the knowledge about data structures in the
library to identify the memory traversal patterns and issues pre-fetches
accordingly. This is contrary to issuing pre-fetches for individual load
instructions independently. This approach can obtain performance improvements
without the assistance of any profiling-compiler or costly hardware even while
restricted to the paradigm of sequential programming languages. Furthermore,
this approach makes pre-fetching transparent to the programmer (using the
library) as one need not modify the application code at all.
Streaming
Streaming is the second programming model supported
in the Smart Memories system. For data parallel applications as in the
multimedia and DSP domain, the stream programming model gives high performance.
By separating a program's computation and communication into kernels and
streams of data, a compiler can make a lot of static optimizations. A high
level compiler such as
Reservoir Labs R-Stream
maps compute kernels to stream co-processors and manages the transfer of
data to software managed local memories. It generates SVM
(Stream Virtual Machine) code, C with SVM API calls, which is then
compiled by a Tensilica XCC compiler. The SVM runtime implements the SVM
API calls to allow a stream program to run on Smart Memories.
Smart Memories is an active participant in the
Morphware Forum, which develops standards such as the
Stream Virtual Machine.
Related publications: F.Labonte, P. Mattson, I. Buck, C. Kozyrakis
and M. Horowitz,
The Stream Virtual Machine,
PACT, September 2004.
Transactional Coherence and Consistency (TCC)
The last major programming model in the Smart Memories system is
transactions. By executing all
codes as transactions on the memory system, TCC offers a simpler
way to parallelize applications than by using different threads.
For more details about TCC please refer to
Stanford TCC website.
Smart Memories test chips
Memory test chip
In February 2003, the Smart Memories group taped out
a reconfigurable memory test chip on the TSMC 0.18um process.
The test chip consisted of four memory blocks, a low swing
crossbar, and testing infrastructure circuits.
The chips were successfully tested in the lab, operating at 1.1GHz
clock frequency at nominal voltage of 1.8 volts (Figure 1).
Results were published in the 2004 ISSCC conference (K. Mai, R. Ho,
E. Alon, D. Liu, Y. Kim, D. Patil, and M. Horowitz.
Architecture and Circuit Techniques for a Reconfigurable Memory Block.
ISSCC, February 2004).
INSERT FIGURE HERE:
Interconnect test chip
In April 2002, Smart Memories taped out a low swing interconnect test chip on
the TSMC 0.18um process (Figure 2). The test chip consisted of
multiple low-swing bus topologies as well as some full-swing buses
for comparison. The test chip also contained a sense amplifier offset
measurement block (later re-spawn on a National Semiconductor
0.25um process). The chips have been tested and a paper is presented at
the 2003 VLSI Circuits Symposium (R. Ho, K. Mai, M. Horowitz.
Efficient On-Chip Global Interconnects.
IEEE Symposium on VLSI Circuits, June 2003).
INSERT FIGURE HERE:
 
< Prev   Next >