Runahead

Runahead is a technique that allows a computer processor to speculatively pre-process instructions during cache miss cycles. The pre-processed instructions are used to generate instruction and data stream prefetches by executing instructions leading to cache misses (typically called long latency loads) before they would normally occur, effectively hiding memory latency. In runahead, the processor uses the idle execution resources to calculate instruction and data stream addresses using the available information that is independent of a cache miss. Once the processor has resolved the initial cache miss, all runahead results are discarded, and the processor resumes execution as normal. The primary use case of the technique is to mitigate the effects of the memory wall. The technique may also be used for other purposes, such as pre-computing branch outcomes to achieve highly accurate branch prediction.^[1]

The principal hardware cost is a means of checkpointing the register file state. Typically, runahead processors will also contain a small additional cache, which allows runahead store operations to execute without modifying actual memory. Certain implementations also use dedicated hardware acceleration units to execute specific slices of pre-processed instructions.^[2]^[3]

Runahead was initially investigated in the context of an in-order microprocessor;^[4] however, this technique has been extended for use with out-of-order microprocessors.^[5]

Triggering

In principle, any event can trigger runahead, though typically the entry condition is a last level data cache miss that makes it to the head of the re-order buffer.^[5] In a normal out-of-order processor, such long latency load instructions block retirement of all younger instructions until the miss is serviced and the load is retired.

When a processor enters runahead mode, it checkpoints all architectural registers and records the address of the load instruction that caused entry into runahead. All instructions in the pipeline are then marked as runahead. Because the value returned from a cache miss cannot be known ahead of time, it is possible for pre-processed instructions to be dependent upon unknown or invalid data. Registers containing such data, or data dependent on it, are denoted by adding an "invalid" or INV bit to every register in the register file. Instructions that use or write such invalid data are also marked with an INV bit. If the instruction that initiated runahead was a load, it is issued a bogus result and marked as INV, allowing it to mark its destination register as INV and drain out of the pipeline.

Pre-processing instructions

In runahead mode, the processor continues to execute instructions after the instruction that initiated runahead. However, runahead is considered a speculative state in which the processor only attempts to generate additional data and instruction cache misses which are effectively prefetches. The designer can opt to allow runahead to skip instructions that are not present in the instruction cache with the understanding that the quality of any prefetches generated will be reduced since the effect of the missing instructions is unknown.

Registers that are the target of an instruction that has one or more source registers marked INV are marked INV. This allows the processor to know which register values can (probably) be trusted during runahead mode. Branch instructions that cannot be resolved due to INV source registers are simply assumed to have been predicted correctly. In case the branch was mispredicted, the processor continues executing wrong-path instructions until it reaches a branch independent point, potentially executing wrong-path loads that pollute cache with useless data entries. Valid branch instruction outcomes can be saved for later use as highly accurate predictions during normal operation.

Since runahead is a speculative state, store instructions cannot be allowed to modify memory. In order to communicate store results to dependent loads, a very small cache only accessed by runahead loads and misses, called a runahead cache, can be used.^[5] This cache is functionally similar to a normal cache, but contains INV bits to track which data is invalid. INV stores set the INV bit of their corresponding target cache line, while valid stores reset the INV bit of the cache line. Any runahead load instruction must check both real and runahead cache. If the load hits in runahead cache, it will discard the real cache result and use the runahead cache data, potentially becoming invalid if the cache line was marked with a INV bit. Because the runahead cache is separate from the memory hierarchy, there is no place to evict old data to. Therefore, in case of a cache conflict, the old data is simply dropped from the cache. Note that because of the limited size of the runahead cache, it is not possible to perfectly track INV data during runahead mode (as INV data may be overwritten by valid data in a cache conflict). In practice, this is not crucial since all results computed during runahead mode are discarded.

Exiting

As with entering runahead, any event can in principle be cause for exiting runahead. Though in the case of a runahead period initiated by a cache miss, it is typically exited once the cache miss has been serviced.

When the processor exits runahead, all instructions younger than and including the instruction that initiated runahead are squashed and drained out of the pipeline. The architectural register file is then restored from the checkpoint. A predetermined register aliasing table (RAT) is then copied into both the front- and backend RAT. Finally, the processor is redirected to the address of the instruction that initiated runahead. The processor then resumes execution in normal mode.

Register file checkpoint options

The simplest method of checkpointing the architectural register file (ARF) is to simply perform a full copy of the entire physical register file (PRF) (because the PRF is a superset of the ARF) to a checkpoint register file (CRF) when the processor enters runahead mode. When runahead is exited, the processor can then perform a full copy from the CRF to the PRF. However, there are more efficient options available.

One way to eliminate the copy operations is to write to both the PRF and CRF during normal operation, but only to the PRF in runahead mode. This approach can eliminate the checkpointing overhead that would otherwise be incurred on initiating runahead if the CRF and PRF are written to in parallel, but still requires the processor to restore the PRF when runahead is exited.

Because the only registers that need to be checkpointed are the architectural registers, the CRF only needs to contain as many registers as there are architectural registers, as defined by the instruction set architecture. Since processors typically contain far more physical registers than architectural registers, this significantly shrinks the size of the CRF.

An even more aggressive approach is to rely only upon the operand forwarding paths of the microarchitecture to provide modified values during runahead mode.^{[citation needed]} The register file is then "checkpointed" by disabling writes to the register file during runahead.

Optimizations

While runahead is intended to increase processor performance, pre-processing instructions when the processor would otherwise have been idle decreases the processor's energy efficiency due to an increase in dynamic power draw. Additionally, entering and exiting runahead incurs a performance overhead, as register checkpointing and particularly flushing the pipeline may take many cycles to complete. Therefore, it is not wise to initiate runahead at every opportunity.

Some optimizations that improve the energy efficiency of runahead are:

Only entering runahead if the processor is expected to execute long latency loads during runahead, thereby reducing short, unproductive runahead periods.^[6]
Limiting the length of runahead periods to only run as long as they are expected to generate useful results.^[7]
Only pre-processing instructions that eventually lead to load instructions.^[8]
Only using free processor resources to pre-process instructions.^[9]
Buffering micro-operations that were decoded during runahead for reuse in normal mode.^[9]

Side effects

Runahead has been found to improve soft error rates in processors as a side effect. While a processor is waiting for a cache miss, the entire state of the processor is vulnerable to soft errors while the cache miss is outstanding. By continuing execution, runahead unintentionally reduces the amount of time the processor state is vulnerable to soft errors, thereby reducing soft error rates.^[10]

References

^ Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2. S2CID 239011545.
^ Hashemi, Milad; Mutlu, Onur; Patt, Yale N. (October 2016). "Continuous runahead: Transparent hardware acceleration for memory intensive workloads". 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1–12. doi:10.1109/MICRO.2016.7783764. ISBN 978-1-5090-3508-3. S2CID 439575.
^ Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2. S2CID 239011545.
^ Dundas, James D. and Mudge, Trevor N. (September 1996). "Using stall cycles to improve microprocessor performance". Technical report. Department of Electrical Engineering and Computer Science, University of Michigan.
^ ^a ^b ^c Mutlu, O.; Stark, J.; Wilkerson, C.; Patt, Y.N. (February 2003). "Runahead execution: An alternative to very large instruction windows for out-of-order processors". The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. pp. 129–140. doi:10.1109/HPCA.2003.1183532. ISBN 0-7695-1871-0. S2CID 9016814.
^ Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009), Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret (eds.), "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor", High Performance Embedded Architectures and Compilers, vol. 5409, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 110–124, doi:10.1007/978-3-540-92990-1_10, ISBN 978-3-540-92989-5, retrieved 2023-06-02
^ Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009). "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor". In Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret; Ungerer, Theo (eds.). High Performance Embedded Architectures and Compilers. Lecture Notes in Computer Science. Vol. 5409. Berlin, Heidelberg: Springer. pp. 110–124. doi:10.1007/978-3-540-92990-1_10. ISBN 978-3-540-92990-1.
^ Hashemi, Milad; Patt, Yale N. (2015-12-05). "Filtered runahead execution with a runahead buffer". Proceedings of the 48th International Symposium on Microarchitecture. MICRO-48. New York, NY, USA: Association for Computing Machinery. pp. 358–369. doi:10.1145/2830772.2830812. ISBN 978-1-4503-4034-2. S2CID 2897777.
^ ^a ^b Naithani, Ajeya; Feliu, Josué; Adileh, Almutaz; Eeckhout, Lieven (February 2020). "Precise Runahead Execution". 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). pp. 397–410. doi:10.1109/HPCA47549.2020.00040. hdl:1854/LU-8668193. ISBN 978-1-7281-6149-5. S2CID 215817567.
^ Naithani, Ajeya; Eeckhout, Lieven (April 2022). "Reliability-Aware Runahead". 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE. pp. 772–785. doi:10.1109/HPCA53966.2022.00062. ISBN 978-1-6654-2027-3. S2CID 248865294.

[1] Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2. S2CID 239011545.

[2] Hashemi, Milad; Mutlu, Onur; Patt, Yale N. (October 2016). "Continuous runahead: Transparent hardware acceleration for memory intensive workloads". 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). pp. 1–12. doi:10.1109/MICRO.2016.7783764. ISBN 978-1-5090-3508-3. S2CID 439575.

[3] Pruett, Stephen; Patt, Yale (October 2021). "Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches". MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '21. New York, NY, USA: Association for Computing Machinery. pp. 804–815. doi:10.1145/3466752.3480053. ISBN 978-1-4503-8557-2. S2CID 239011545.

[4] Dundas, James D. and Mudge, Trevor N. (September 1996). "Using stall cycles to improve microprocessor performance". Technical report. Department of Electrical Engineering and Computer Science, University of Michigan.

[:0-5] Mutlu, O.; Stark, J.; Wilkerson, C.; Patt, Y.N. (February 2003). "Runahead execution: An alternative to very large instruction windows for out-of-order processors". The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. pp. 129–140. doi:10.1109/HPCA.2003.1183532. ISBN 0-7695-1871-0. S2CID 9016814.

[6] Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009), Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret (eds.), "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor", High Performance Embedded Architectures and Compilers, vol. 5409, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 110–124, doi:10.1007/978-3-540-92990-1_10, ISBN 978-3-540-92989-5, retrieved 2023-06-02

[7] Van Craeynest, Kenzo; Eyerman, Stijn; Eeckhout, Lieven (2009). "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor". In Seznec, André; Emer, Joel; O’Boyle, Michael; Martonosi, Margaret; Ungerer, Theo (eds.). High Performance Embedded Architectures and Compilers. Lecture Notes in Computer Science. Vol. 5409. Berlin, Heidelberg: Springer. pp. 110–124. doi:10.1007/978-3-540-92990-1_10. ISBN 978-3-540-92990-1.

[8] Hashemi, Milad; Patt, Yale N. (2015-12-05). "Filtered runahead execution with a runahead buffer". Proceedings of the 48th International Symposium on Microarchitecture. MICRO-48. New York, NY, USA: Association for Computing Machinery. pp. 358–369. doi:10.1145/2830772.2830812. ISBN 978-1-4503-4034-2. S2CID 2897777.

[:1-9] Naithani, Ajeya; Feliu, Josué; Adileh, Almutaz; Eeckhout, Lieven (February 2020). "Precise Runahead Execution". 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). pp. 397–410. doi:10.1109/HPCA47549.2020.00040. hdl:1854/LU-8668193. ISBN 978-1-7281-6149-5. S2CID 215817567.

[10] Naithani, Ajeya; Eeckhout, Lieven (April 2022). "Reliability-Aware Runahead". 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE. pp. 772–785. doi:10.1109/HPCA53966.2022.00062. ISBN 978-1-6654-2027-3. S2CID 248865294.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]