ECE506 DSM Protocols and Races Wiki Supplement

In point-to-point interconnection networks, distributed shared memory multiprocessors provide highly scalable organization. Each processor in the system communicates via an interconnect channel by sending requests and data through it and is connected to the main memory in this manner. Each node, or processor, in the system reserves its own cache to reduce access time by limiting the number of times data is transferred between nodes.
DSM Protocols
Directory-based
In a directory-based protocol, requests are made by a node in a point to point system to either a centralized or distributed directory which determines the state of access of each block in the system. Due to this, the system can have higher cache-to-cache latency. A variation of Directory-based protocol mentioned in Lightweight Directory Architecture tries to reduce this latency by incorporating a directory-based structure in the cache at each node. The Cray XT1 system uses a directory-based cache coherence protocol. The Dawning 6000, rated in the top 10 fastest computers in 2008, also uses a directory-based cache coherence protocol.
The figure above describes the Lightweight directory-based cache coherence architecture.
Snoopy
Snoopy-based protocol is a broadcast based protocol, i.e. any change in memory results in message passing through the interconnect to update all values in necessary nodes. As a result, although there is low latency during cache transfers, there is much higher bandwidth usage and network workload. Thus, such a protocol is most effective in smaller systems. The snoopy scheme does not have the overhead of directory look-up and maintenance, so it requires less latency for memory accesses. For example, on a read miss to a shared block, a request can be put on the bus, and the owner of the requested block will provide the data; additionally, the state of the block in all caches will be changed accordingly.
Compiler-Directed Coherence
Compiler-Directed Coherence is a type of cache coherence protocol which usually relies on cache line prefetching and makes use of compiler analysis methods to determining potentially stale, or out-of-date, references. Three main program analysis techniques are used in stale reference analysis : stale reference detection, array dataflow analysis, and interprocedural analysis. Much of this analysis takes places at loop boundaries and its purpose is mainly to detect interprocess, or between iterations, locality in order to allow safe accesses.
Performance evaluation
The effectiveness of the Snooping and Directory protocols are dependent on the amount of bandwidth available. As seen in the graph below, when there is limited bandwidth, directories are much more effective. However, with plenty of bandwidth, snooping is far superior to directories. Thus, the hybrid protocol, BASH (Bandwidth Adaptive Snooping Hybrid), utilizes these two to create a more effective protocol.
BASH behaves like directories when bandwidth is limited, and then imitates snooping with excess bandwidth. It uses snooping-like broadcast requests as well as directory-like unicast requests. BASH estimates the available bandwidth and adjusts the rate of broadcast based on estimate. BASH calculates where link utilization is in reference to a static threshold, so it is able to adapt between imitating snooping and directory protocols.
DSM Race Conditions
Busy State Race
The home node is busy because P2 is sending a request from node P1. However, P1 is in the process of writeback. The intervention of requesting and the writeback are crossing, thus home is stuck unable to drop writeback and unable to NACK. If home were to NACK, the writeback would be retried later and P2 would receive an outdated value from P1. This is seen in the image below.
The home node is busy because P2 is sending a request from node P1. However, P1 is in the process of writeback. The intervention of requesting and the writeback are crossing, thus home is stuck unable to drop writeback and unable to NACK. If home were to NACK, the writeback would be retried later and P2 would receive an outdated value from P1. This is seen in the image below.
 
< Prev   Next >