Mill Computing, Inc.
The Mill architecture is a novel belt machine-based computer architecture for general purpose computing. It has been under development since about 2003 by Ivan Godard and his startup Mill Computing, Inc., formerly named Out Of The Box Computing, in East Palo Alto, California. Mill Computing claims it has a "10x single-thread power/performance gain over conventional out-of-order superscalar architectures" but "runs the same programs, without rewrite".
The designers claim that the power and cost improvements come by adapting a DSP-like deeply-pipelined processor to general-purpose code. The timing hazards from branches and memory access are said to be handled via speculative execution, pipelining, and other late binding, but statically-scheduled logic. The claimed improvements in power and area are said to come from eliminating dynamic optimizing hardware: register-renaming, out-of-order execution hazard management, and dynamic cache optimizing. To replace that hardware, each mill processor is designed to have timing and memory-access behavior predictable to single cycles, so that all the scheduling is subject to a highly-optimizing compiler.
Mill uses a very long instruction word (VLIW)-style encoding to store up to 33 simple operations in wide instruction words, termed opcodes. Mill uses two program counters, and every wide instruction is split into two parts. One of the program counters counts backward. So, the code of every linear instruction block is executed from its middle to outside by two almost independent decoders. Unused operations are deleted by a small fixed-format data item in the center of each instruction. This helps maintain code density by reducing the incidence of no-operation codes in Mill code. It also allows each functional unit to start speculatively executing its instruction field, and then discard its result if it has no instruction.
Thus, the mill uses a novel temporal register addressing scheme, the belt, which has been proposed by Ivan Godard to greatly reduce the complexity of processor hardware, specifically the number of internal registers. It aids understanding to view the belt as a moving conveyor belt where the oldest values drop off the belt and vanish.
The relative-addressing aspects of the mill's machine code and assembly language may be harder to read and debug than the more conventional register name paradigm, but few large projects are written in such low-level programming languages. Eliminating registers avoids complex register renaming schemes.
Mill instructions do not need to specify a location to store a result. Thus, they are smaller by that amount.
Godard says that the belt is not a shift register. Instead it is a semantic representation of the bypass network present in most fast computers, which intercepts pipelined accesses to registers, routing them directly to the execution units that need the result. The number of registers is reasonably small: those needed to pipeline the output of each functional unit, and one for each possible belt item. The small number of registers reduces the size, power and complexity of the network to access the registers.
Belt items are accessed by belt position, and move by changing their names in a pipeline-safe way. The names are not only belt positions, but also tags for function frames. By only incrementing the frame tag counter, the belt appears empty to a newly called function.
The length of the belt is designed so that residence time in the belt equals the time to access the scratchpad, a random-access memory (RAM) area used to spill belt items to be reused.
The belt is the fast, CPU end of a hardware caching system called the spiller, which moves belt items between subroutines, the scratchpad, static random-access memory (SRAM) buffer, and the reserved spiller memory area (backed by L2 cache) associated with each functional iteration's data area. If the bandwidth of the spiller is exceeded, the mill stalls, waiting for the belt to become consistent.
A patent US 9513921 on the belt was granted in 2016.
Depending on the type and success of load operations, the mill also assigns metadata to each belt item, including status, width, and vectorization count. Operations operate on the item described. Thus, the width and vector count are not part of the instruction coding. If an operation fails, the failure information is hashed, and placed in the destination, with its metadata, for use in debugging.
The Mill also uses the metadata to assist speculative execution and pipelining. For example, if a vector load operation fails (e.g., part of it leaves a protection boundary) those parts of that belt entry will be marked as
not a result (NaR) in the metadata. This allows speculatively-executed vector code to emulate per-vector-item fault behavior. The NaR items create a fault only if an attempt occurs to store them or perform other non-speculative code on them. If they are never used, no fault is ever created.
The mill's architecture appears able to reduce the size and complexity of pipelined loop code. It uses metadata and speculation to eliminate pipeline set-up and teardown. In the pipeline video, every operation was required to cope with an argument of
not a number (NaN) in a sensible way: arithmetic and bit-wise logical operations produce a NaN if any input is a NaN. Stores and other non-speculable operations do nothing. To run a pipelined loop, the code pushes a group of NaNs on the belt, and then starts to execute the steady-state loop body. As live data iterates in the loop body, the pipeline is initialized. Teardown happens in a parallel way by feeding NaNs to the loop. A crucial invention was to allow operations to insert NaNs on the belt, for pipelined loops only.
To pipeline nested loops, the mill treats each loop almost like a subroutine call, with saves and restores of appropriate state.
Another improvement said to open up the instruction-level parallelism is that mill instructions are phased. Instructions may span several clock cycles, and hold up to 33 operations. Within an instruction, finishing occurs to math operations first, data rearrangements in the middle, and stores to memory last. Also, both the operations and even multiple cores operate in statically-predictable prioritized timings.
There are several versions of the mill processor in development, spanning Tin (low-end uses) to Gold (high-performance uses). The company estimates that dual-core Gold chip implemented with 28 nm lithography may work at 1.2 GHz with a typical thermal design power (TDP) of 28 watts and performance of 79 billion operations per second.
Different versions of the mill are intended for different markets, and are said to have different instruction set architectures, different numbers of execution units, different pipeline timings, and thus, very different binaries. To accommodate these, compilers are required to emit a specification which is then recompiled into an executable binary by a recompiler supplied by the Mill Computing company. In this way, code that can be distributed is adapted to specifics of the exact model's pipeline, binary coding, etc.
The development of so many tool sets and processor designs may be impractically costly. Ivan Godard said that Mill's plan is to develop software tools that accept a specification for a mill processor, and then write the software tools (assembler, compiler backend, and simulator), and the Verilog describing the CPU. In a demo video, Mill claimed to show early versions of the software to create an assembler and simulator. The bulk of the compiler is said to be a port of LLVM. As of 2014[update], it is incomplete.