Mill architecture

The Mill architecture is a novel belt machine-based computer architecture for general-purpose computing. It has been under development since about 2003 by Ivan Godard and his startup Mill Computing, Inc., formerly named Out Of The Box Computing, in East Palo Alto, California. Mill Computing claims it has a "10x single-thread power/performance gain over conventional out-of-order superscalar architectures" but "runs the same programs, without rewrite".
Mill Computing was founded by persons who formerly worked together on a family of digital signal processors (DSPs), the Philips Trimedia.
Approach
The designers claim that the power and cost improvements come by adapting a DSP-like deeply-pipelined processor to general-purpose code. The timing hazards from branches and memory access are said to be handled not via speculative execution, pipelining, and other late binding, but statically-scheduled logic. The claimed improvements in power and area are said to come from eliminating dynamic optimizing hardware: register-renaming, out-of-order execution hazard management, and dynamic cache optimizing. To replace that hardware, each Mill processor is designed to have timing and memory-access behavior predictable to single cycles, so that all the scheduling is subject to a highly-optimizing compiler.
Very long instruction words and split stream instructions
Mill uses a very long instruction word (VLIW)-style encoding to place up to 33 simple operations, a.k.a. opcodes in wide instruction words. These words are not contiguous in memory, but are split into two instruction bundles and placed into two data streams. Each stream is handled by a mostly independent decoder with its own top-level instruction cache.
Instructions are arranged in extended basic blocks, and decoding for both halves begin at the same address in the middle of the block. As the instruction bundles are decoded the program counter in one stream increases and the counter in the other decoder decreases
For a typical instruction like <code>add</code>, both argument operands come from explicitly named positions on the belt, and the result is dropped on the front, ready for the next instruction. Operations with multiple results simply drop more values at the belt front. For example, division can produce a quotient and a remainder. Operations with overflow can produce double-wide results. Most belt instructions are encoded as just an operation code (opcode) and two belt positions, with no added fields to specify a result register, memory address, or literal constant. This encoding is easily extended to richer operations with more than two inputs or more than one result. Constant operands are dropped by separate <code>load immediate</code> instructions. All access of program variables in main random-access memory (RAM) is segregated into separate <code>load</code> or <code>store</code> instructions containing one memory address, or some way to calculate that address from belt operands.
All belt machines have variants of the load/store opcodes to access local variables and the heap. This can be by offsets, from a pointer on the belt, or from various special-purpose base registers. Similarly, there will be instructions to branch to an address taken from the belt, along with branches relative to the program counter.
Temporal addressing
Because each drop of a result moves the prior belt content along to later positions in the queue, a given operand continually changes its position (and hence address) as a result of later execution. In effect, an access to the operand at position zero is a request for the most recent value dropped to the belt, while (for example) a reference to position five is to the sixth most recent drop. Thus the addresses of belt operands reflect the belt history over time. This is temporal addressing. It is hard for human programmers to keep track of belt contents, and hence operand addresses, when writing assembly code for a belt machine. However, it is easy for a compiler to track the changing contents and emit the correct position addresses in generated code.
Spill and fill
The belt is fixed length and may be too short to hold all live transient operands before they are pushed off the end. If an operand is needed for longer than its belt lifetime, it must be saved while still on the belt (spill) and later restored to the belt when needed again (fill). This situation is equivalent to the need to spill registers to memory when a program runs out of registers in a general-register machine. Spilled operands may be written to memory using normal store instructions, and restored using normal load instructions, or spill and fill may use special-purpose storage and associated operations that are faster or offer other advantages over load and store.
Freedom from hazard
The operands on the belt are read-only. New results do not overwrite prior values. The belt is thus a structure, and is immune to the that must be dealt with by modern out-of-order general-register machines.
Compact object code
Dense machine code was very valuable in the 1960s, when main memory was very costly and limited, even on mainframe computers. It became important again on the initially-tiny memories of minicomputers, and then microprocessors. Density remains important today, for applications for smartphone, or downloaded into browsers over slow Internet connections, and in read-only memory (ROM) for embedded applications. A more general advantage of increased density is improved effectiveness of caches and instruction prefetch.
Belt machines have smaller instructions than register-based machines, due to not needing a destination address for results. This saving can make a significant difference for fixed-length instruction formats, which normally use power-of-two instruction widths. If there are thirty-two addressable elements (registers on a general-register machine, belt positions on a belt machine), then each element address occupies five bits in the instruction, needing 15 bits for the three-address format of a general-register machine, but only 10 bits using the two-address format of a belt machine. Because bits are also needed for opcode and other information in the instruction, the (power-of-two constrained) instruction width often determines the maximum number of addressable elements possible in a design. Typically a belt machine instruction can support the encoding of double the number of addressable elements compared to a general-register machine of the same instruction width. There are similar gains in variable-length instruction encodings.
Overall, belt machine code is less compact than for stack machines, which use no operand addresses, but often must introduce stack-manipulation instructions unneeded on a belt machine. The instructions for accumulator machines are not padded out with multiple register fields, instead, they use the return stack and need no extra memory reference instructions.
Implementation
While a belt machine presents an operand queue as the program model, the mill architecture doesn't implement the belt as a physical queue (shift register) in the implemented hardware. Instead it is a semantic representation of the bypass network present in most fast computers, which intercepts pipelined accesses to registers, routing them directly to the execution units that need the result. The number of registers is reasonably small: those needed to pipeline the output of each functional unit, and one for each possible belt item. The small number of registers reduces the size, power and complexity of the network to access the registers. Live data values are kept in conveniently addressable physical resources (individual registers, register files, static random-access memory (SRAM), or operand forwarding from functional units) and generally not moved for the duration of their belt lifetime. Instruction decoder maps logical belt positions to physical locations. The mapping is updated to reflect the changes of logical position arising from newly dropped results.
A patent on the belt was granted in 2016.
Metadata use
Depending on the type and success of load operations, the Mill also assigns metadata to each belt item, including status, width, and vectorization count. Operations operate on the item described. Thus, the width and vector count are not part of the instruction coding. If an operation fails, the failure information is hashed, and placed in the destination, with its metadata, for use in debugging.
The Mill also uses the metadata to assist speculative execution and pipelining. For example, if a vector load operation fails (e.g., part of it leaves a protection boundary) those parts of that belt entry will be marked as <code>not a result</code> (NaR) in the metadata. This allows speculatively-executed vector code to emulate per-vector-item fault behavior. The NaR items create a fault only if an attempt occurs to store them or perform other non-speculative code on them. If they are never used, no fault is ever created.
The Mill's architecture appears able to reduce the size and complexity of pipelined loop code. In the pipeline video, every operation was required to cope with a special operand value called <code>None</code> (not to be confused with <code>not a number</code> in floating point formats), which has special semantics for this purpose: operations where at least one argument is a <code>None</code> generally produce a <code>None</code> as output, and when a <code>None</code> is attempted to be stored to memory, that store (or portion of a store for vectors where only some elements are <code>None</code>) is ignored, leaving that memory location undisturbed. This special <code>None</code> value is not implemented as a reserved bit pattern, but by using the extra metadata bits that are associated with each belt item. In the first few iterations of a pipelined loop, the code drops a group of <code>Nones</code> on the belt using a special <code>retire</code> operation, which tells the CPU how many items should drop in that cycle (that is, however many real items drop onto the belt from previous scheduled operations, <code>retire</code> drops only enough additional <code>Nones</code> to bring the total number of drops that cycle to the requested amount - once steady state is reached, generally no <code>Nones</code> will be generated). This way, the belt always has the expected number of new items for the usual steady-state loop body to operate on, with the <code>Nones</code> acting as placeholders for data that isn't ready. As the operations scheduled from previous iterations of the loop begin dropping results, the belt therefore starts each new loop iteration with more real data items and fewer placeholder <code>Nones</code> (the ordering rules for simultaneous drops from different-latency operations and <code>retire</code> ensure that as more real results appear in new iterations, they always occupy a position that had a <code>None</code> in all previous iterations, so that each operation can use the same input belt number for all iterations). Meanwhile, the store operations in the loop body receive <code>None</code> values until the loop's steady state is reached, and therefore have no effect until real results are available for storing. Thus, the loop body that handles the steady-state of the pipeline code, which includes appropriate <code>retire</code> operations, acts as its own prologue code. The processing of the final elements through the pipeline can generally be finished by having this same loop deliberately execute extra iterations, such that the remaining scheduled operations have time to finish and be stored to memory, because nearly all operations have no side effects (attempted invalid memory reads merely produce a <code>NaR</code> value on the belt, which does not cause a fault unless it is then used by a store or flow control operation).
To pipeline nested loops, the Mill treats each loop almost like a subroutine call, with automatic saves and restores of appropriate state (belt and scratchpad).
Lockstep phased execution
Mill instructions are phased, with the up to 33 operations in a single instruction word being issued over three clock cycles. Mill phasing can capture very short traces and data-flows in a single instruction, and is claimed to improve the available instruction-level parallelism, particularly around control flow. Each phase is overlapped with different phases from neighboring instructions. The phases are also closely tied to the decode bundle arrangement, and allow the decode hardware to be simpler and pipe-lined.
Within an instruction, the reader phase occurs first. These are operations that require no inputs and create output that is available the next cycle. These drop values directly from the instruction stream or from reading static byte addresses of the scratchpad.
Next is the op or compute phase that takes inputs from the belt and drops values onto the belts, such as loads or arithmetic operations. Outputs from the compute phase may take several cycles to retire, dropping onto the belt with hardware-enforced static latency and order.
Then writer phase reads value from the belt and changes global state, but not does not create belt values. Stores and branches occur here, as well as writes to scratchpad addresses.
There are a few other conceptual phases that aren't part of the 3-cycle skew. The pick operation is equivalent to the ternary if operator (?:) and is implemented on bypass network control between the compute and writer phase and adds no computational delay. The Call phase is implemented in the same place, and the hardware saves/restores state such that from the program model no cycles have occurred in the callee, even though many real cycles may have passed until the return.
Family traits
There are several versions of the Mill processor in development, spanning Tin (low-end uses) to Gold (high-performance uses). The company estimates that dual-core Gold chips implemented with 28 nm lithography may work at 1.2 GHz with a typical thermal design power (TDP) of 28watts and performance of 79 billion operations per second (GIPS).
Different versions of the Mill are intended for different markets, and are said to have different instruction set architectures, different numbers of execution units, different pipeline timings, and thus, very different binaries. To accommodate these, compilers are required to emit a specification which is then recompiled into an executable binary by a recompiler supplied by the Mill Computing company. In this way, code that can be distributed is adapted to specifics of the exact model's pipeline, binary coding, etc.
The development of so many tool sets and processor designs may be impractically costly. Ivan Godard said that Mill's plan is to develop software tools that accept a specification for a Mill processor, and then write the software tools (assembler, compiler backend, and simulator), and the Verilog describing the CPU. In a demo video, Mill claimed to show early versions of the software to create an assembler and simulator. The bulk of the compiler is said to be a port of LLVM. , it is incomplete.
 
< Prev   Next >