The Alpha 21264 is a four-issue superscalar microprocessor with out-of-order execution and speculative execution. It has a peak execution rate of six instructions per cycle and could sustain four instructions per cycle. It has a seven-stage instruction pipeline.
Out of order executionEdit
At any given stage, the microprocessor could have up to 80 instructions in various stages of execution, surpassing any other contemporary microprocessor.
Decoded instructions are held in instruction queues and are issued when their operands are available. The integer queue contained 20 entries and the floating-point queue 15. Each queue could issue as many instructions as there were pipelines.
The Ebox executes integer, load and store instructions. It has two integer units, two load store units and two integer register files. Each integer register file contained 80 entries, of which 32 are architectural registers, 40 are rename registers and 8 are PAL shadow registers. There was no entry for register R31 because in the Alpha architecture, R31 is hardwired to zero and is read-only.
Each register file served an integer unit and a load store unit, and the register file and its two units are referred to as a "cluster". The two clusters were designated U0 and U1. This scheme was used as it reduced the number of write and read ports required to serve operands and receive results, thus reducing the physical size of the register file, enabling the microprocessor to operate at higher clock frequencies. Writes to any of the register files thus have to be synchronized, which required a clock cycle to complete, negatively impacting performance by one percent. The reduction of performance resulting from the synchronization was compensated in two ways. Firstly, the higher clock frequency achievable offset the loss. Secondly, the logic responsible for instruction issue avoided creating situations where the register file had to be synchronized by issuing instructions that were not dependent on data held in other register file where possible.
The clusters are near identical except for two differences: U1 has a seven-cycle pipelined multiplier while U0 has a three-cycle pipeline for executing Motion Video Instructions (MVI), an extension to the Alpha Architecture defining single instruction multiple data (SIMD) instructions for multimedia.
The load store units are simple arithmetic logic units used to calculate virtual addresses for memory access. They are also capable of executing simple arithmetic and logic instructions. The Alpha 21264 instruction issue logic utilized this capability, issuing instructions to these units when they were available for use (not performing address arithmetic).
The Fbox is responsible for executing floating-point instructions. It consists of two floating-point pipelines and a floating-point register file. The pipelines are not identical, one executes the majority of instructions and the other only multiply instructions. The adder pipeline has two non-pipelined units connected to it, a divide unit and a square root unit. Adds, multiplies and most other instructions have a 4-cycle latency, a double-precision divide has 16-cycle latency and a double-precision square root has a 33-cycle latency. The floating point register file contains 72 entries, of which 32 are architectural registers and 40 are rename registers.
The primary cache is split into separate caches for instructions and data ("modified Harvard architecture"), the I-cache and D-cache, respectively. Both caches have a capacity of 64 KB. The D-cache is dual-ported by transferring data on both the rising and falling edges of the clock signal. This method of dual-porting enabled any combination of reads or writes to the cache every processor cycle. It also avoided duplication the cache so there are two, as in the Alpha 21164. Duplicating the cache restricted the capacity of the cache, as it required more transistors to provide the same amount of capacity, and in turn increased the area required and power consumed.
The secondary cache, termed the B-cache, is an external cache with a capacity of 1 to 16 MB. It is controlled by the microprocessor and is implemented by synchronous static random access memory (SSRAM) chips that operate at two thirds, half, one-third or one-fourth the internal clock frequency, or 133 to 333 MHz at 500 MHz. The B-cache was accessed with a dedicated 128-bit bus that operates at the same clock frequency as the SSRAM or at twice the clock frequency if double data rate SSRAM is used. The B-cache is direct-mapped.
Branch prediction is performed by a tournament branch prediction algorithm. The algorithm was developed by Scott McFarling at Digital's Western Research Laboratory (WRL) and was described in a 1993 paper. This predictor was used as the Alpha 21264 has a minimum branch misprediction penalty of seven cycles. Due to the instruction cache's two cycle latency and the instruction queues, the average branch misprediction penalty is 11 cycles. The algorithm maintains two history tables, Local and Global, and the table used to predict the outcome of a branch is determined by a Choice predictor.
The local predictor is a two-level table which records the history of individual branches. It consists of a 1,024-entry by 10-bit branch history table. A two-level table was used as the prediction accuracy is similar to that of a larger single-level table while requiring fewer bits of storage. It has a 1,024-entry branch prediction table. Each entry is a 3-bit saturating counter. The value of the counter determines whether the current branch is taken or not taken.
The global predictor is a single-level, 4096-entry branch history table. Each entry is a 2-bit saturating counter; the value of this counter determines whether the current branch is taken or not taken.
The choice predictor records the history of the local and global predictors to determine which predictor is the best for a particular branch. It has a 4,096-entry branch history table. Each entry is a 2-bit saturating counter. The value of the counter determines if the local or global predictor is used.
The external interface consisted of a bidirectional 64-bit double data rate (DDR) data bus and two 15-bit unidirectional time-multiplexed address and control buses, one for signals originating from the Alpha 21264 and one for signals originating from the system. Digital licensed the bus to Advanced Micro Devices (AMD), and it was subsequently used in their Athlon microprocessors, where it was known as the EV6 bus.
Alpha 21264 CPU supports 48-bit or 43-bit virtual address (256 TiB or 8 TiB virtual address space respectively), selectable under IPR control (using VA_CTL control register). Alpha 21264 supports a 44-bit physical address (up to 16 TiB of physical memory). This is an increase from previous Alpha CPUs (43-bit virtual and 40-bit physical for Alpha 21164, and 43-bit virtual and 34-bit physical for Alpha 21064).
The Alpha 21264 contained 15.2 million transistors. The logic consisted of approximately six million transistors, with the rest contained in the caches and branch history tables. The die measured 16.7 mm by 18.8 mm (313.96 mm²). It was fabricated in a 0.35 μm complementary metal–oxide–semiconductor (CMOS) process with six levels of interconnect.
The Alpha 21264 was packaged in a 587-pin ceramic interstitial pin grid array (IPGA).
Alpha Processor, Inc. later sold the Alpha 21264 in a Slot B package containing the microprocessor mounted on a printed circuit board with the B-cache and voltage regulators. The design was intended to use the success of slot-based microprocessors from Intel and AMD. Slot B was originally developed to be used by AMD's Athlon as well, so that API could obtain materials for the Slot B at commodity prices in order to reduce the cost of the Alpha 21264 to gain a wider market share. This never materialized as AMD chose to use Slot A for their slot-based Athlons.
The Alpha 21264A, code-named EV67 was a shrink of the Alpha 21264 introduced in late 1999. There were six versions: 600, 667, 700, 733, 750, 833 MHz. The EV67 was the first Alpha microprocessor to implement the count extension (CIX), which extended the instruction set with instructions for performing population count. It was fabricated by Samsung Electronics in a 0.25 μm CMOS process that had 0.25 μm transistors but 0.35 μm metal layers. The die had an area of 210 mm². The EV68 used a 2.0 V power supply. It dissipated a maximum of 73 W at 600 MHz, 80 W at 667 MHz, 85 W at 700 MHz, 88 W at 733 MHz and 90 W at 750 MHz.
The Alpha 21264B is a further development for increased clock frequencies. There were two models, one fabricated by IBM, code-named EV68C, and one by Samsung, code-named EV68A.
The EV68A was fabricated in a 0.18 μm CMOS process with aluminium interconnects. It had a die size of 125 mm², a third smaller than the Alpha 21264A, and used a 1.7 V power supply. It was available in volume in 2001 at clock frequencies of 750, 833, 875 and 940 MHz. The EV68A dissipated a maximum of 60 W at 750 MHz, 67 W at 833 MHz, 70 W at 875 MHz and 75 W at 940 MHz.
The EV68C was fabricated in a 0.18 μm CMOS process with copper interconnects. It was sampled in early 2000 and achieved a maximum clock frequency of 1.25 GHz.
In September 1998, Samsung announced they would fabricate a variant of the Alpha 21264B in a 0.18 μm fully depleted silicon-on-insulator (SOI) process with copper interconnects that was capable of achieving a clock frequency of 1.5 GHz. This version never materialized.
The Alpha 21264C, code-named EV68CB was a derivative of the Alpha 21264. It was available at clock frequencies of 1.0, 1.25 and 1.33 GHz. The EV68CB contained 15.5 million transistors and measured 120 mm². It was fabricated by IBM in a 0.18 μm CMOS process with seven levels of copper interconnect and low-K dielectric. It was packaged in a 675-pad flip-chip ceramic land grid array (CLGA) measuring 49.53 by 49.53 mm. The EV68CB used a 1.7 V power supply, dissipating a maximum of 64 W at 1.0 GHz, 75 W at 1.25 GHz and 80 W at 1.33 GHz.
The Alpha 21264D, code-named EV68CD is a faster derivative fabricated by IBM.
The Alpha 21264E, code-named EV68E, was a cancelled derivative developed by Samsung first announced on 10 October 2000 at Microprocessor Forum 2000 slated for introduction at around mid-2001. Improvements were a higher operating frequency of 1.25 GHz and the addition of an on-die 1.85 MB secondary cache. It was to be fabricated in a 0.18 micrometre CMOS process with copper interconnects.
Digital and Advanced Micro Devices (AMD) both developed chipsets for the Alpha 21264.
The Digital 21272, also known as the Tsunami, and the 21274, also known as the Typhoon, were the first chipset for the Alpha 21264. The 21272 chipset supported one- or two-way multiprocessing and up to 8GB of memory, while the 21274 supported one-, two-, three- or four-way multiprocessing, up to 64GB of memory, and both supported one or two 64-bit 33 MHz PCI buses. They had 128- to 512-bit memory bus which operated at 83 MHz, yielding a maximum bandwidth of 5,312 MB/s. The chipset supported 100 MHz registered ECC SDRAM.
The chipset consisted of three devices, a C-chip, a D-chip and a P-chip. The number of devices which made up the chipset varied as it was determined by the configuration of the chipset. The C-chip is the control chip containing the memory controller. One C-chip was required for every microprocessor.
The P-chip is the PCI controller, implementing a 33 MHz PCI bus. The 21272 could have one or two P-chips.
The D-chip is the DRAM controller, implementing access to/from the CPUs, and to/from the P-chip. The 21272 could have two or four D-chips and the 21274 could have two, four, or eight D-chips.
The 21272 and 21274 were used extensively by Digital, Compaq and Hewlett Packard in their entry-level to mid-range AlphaServers and in all models of the AlphaStation. It was also used in third-party products from Alpha Processor, Inc. (later known as API NetWorks) such as their UP2000+ motherboard.
AMD developed two Alpha 21264-compatible chipsets, the Irongate, also known as the AMD-751, and its successor, Irongate-2, also known as the AMD-761. These chipsets were developed for their Athlon microprocessors but due to AMD licensing the EV6 bus used in the Alpha from Digital, the Athlon and Alpha 21264 were compatible in terms of bus protocol. The Irongate was used by Samsung in their UP1000 and UP1100 motherboards. The Irongate-2 was used by Samsung in their UP1500 motherboard.
- The Alpha 21264 Microprocessor Architecture, p. 5.
- "Alpha 21264 Microprocessor Data Sheet" (PDF). Compaq Computer Corporation. Retrieved 2020-06-03.
- Gronowski, "High Performance Microprocessor Design", p. 676.
- Compaq, "21264/EV68A Microprocessor Hardware Reference Manual".
- Compaq, "21264/EV68CB and 21264/EV68DC Hardware Reference Manual".
- Compaq Computer Corporation (July 1999). Alpha 21264 Microprocessor Hardware Reference Manual.
- Compaq Computer Corporation (June 2001). 21264/EV68CB and 21264/EV68DC Hardware Reference Manual.
- Compaq Computer Corporation (March 2002). 21264/EV67 Microprocessor Hardware Reference Manual.
- Compaq Computer Corporation (March 2002). 21264/EV68A Microprocessor Hardware Reference Manual.
- Gronowski, Paul E. et al. (1998). "High Performance Microprocessor Design". IEEE Journal of Solid-State Circuits, Volume 33, Number 5, pp. 676–686.
- Gwennap, Linley (28 October 1996). "Digital 21264 Sets New Standard". Microprocessor Report, Volume 10, Number 14. MicroDesign Resources.
- Kessler, R. E.; McLellan, E. J. and Webb, D. A. (1998) "The Alpha 21264 Microprocessor Architecture". Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors. pp. 90–95.
- Kessler, R. E. (1999). "The Alpha 21264 Microprocessor". IEEE Micro, March–April 1999. pp. 24–36.
- Leibholz, Daniel and Razdan, Rahul (1997). "The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor". Proceedings of Compcon '97. pp. 28–36.
- Matson, M. et al. "Circuit Implementation of a 600MHz Superscalar RISC Microprocessor". Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors. pp. 104–110.
- Benschneider, B.J. et al. (2000). "A 1 GHz Alpha microprocessor". ISSCC Digest of Technical Papers, pp. 86–87.
- Clouser, J. et al. (July 1999). "A 600-MHz superscalar floating-point processor". IEEE Journal of Solid-State Circuits 34 (7): pp. 1026–1029.
- Fischer, T.; Leibholz, D. (1998). "Design trade offs in stall-control circuits for 600 MHz instruction queues". ISSCC Digest of Technical Papers, pp. 232–234, 444.
- Gieseke, B.A. et al. (1997). "A 600 MHz superscalar RISC microprocessor with out-of-order execution". ISSCC Digest of Technical Papers, pp. 176–177, 451.
- Gronowski, Paul E. et al. (May 1998). "High-performance microprocessor design". IEEE Journal of Solid-State Circuits 33 (5): pp. 676–686.
- Hokinson, R. et al. (2001). "Design and migration challenges for an Alpha microprocessor in a 0.18 μm copper process". ISSCC Digest of Technical Papers, pp. 320–321, 460.