Part II, Part III and ACS Projects
The following project suggestions are starting points for possible Part II, Part III and ACS projects. ACS and Part III projects require more of a research emphasis to be successful, where as Part II projects might focus more on computer engineering and reproducibility of results, though more research-oriented projects are possible and can attract top marks if successful.
Please contact the proposer(s) by email if you are interested in any of the projects below. In addition, some of the projects from previous years may still be suitable and interesting. Please remember, these are just starting points that suggest possible directions for the research. You can continue to check here again over the coming weeks for more projects. We would also be happy to consider any project ideas you have too.
Project suggestions for 2025/26
Here are the project suggestions for this year. Feel free to contact us about other project ideas you have too.
Approximate recovery of relative execution counts for profiling
Contact: Alexandra W. Chadwick
OptiWISE [https://github.com/CompArchCam/optiwise] is a profiling tool developed here at the lab which calculates a detailed per-instruction cost, measured by running a program and analysing its performance. Such profiling information can help performance engineers to optimise code. However, a big constraint of OptiWISE is that the program must be run twice to obtain results, using different profiling strategies. This means it is presently unsuitable for profiling programs with random or emergent dynamic behaviour, particularly multi-threaded programs. The aim of this project would be to approximate the execution count data that is obtained in the second run of OptiWISE by using the sampling data from the first run. This approximation could be based on one or more of many different strategies: processor-specific cost models, architecture-specific performance counters such as Intel LBR, or perhaps some machine learning approach. The project would then be able to validate its accuracy against the current OptiWISE results to see whether or not accurate results can indeed be obtained through approximation or not.
Optimising 'matrix multiplication' calculations in the tropical semiring
Contact: Alexandra W. Chadwick
The Tropical Semiring [https://en.wikipedia.org/wiki/Tropical_semiring] is an alternative to conventional linear algebra in which the operations + and 'max' are used in place of the conventional operators × and +. It can be used to efficiently represent some mathematical problems; notably finding the longest path in a directed acyclic graph can be represented as a series of 'matrix multiplications' in the tropical semiring. This project would aim to create a library for performing such matrix calculations efficiently. Much of the related work of optimising conventional matrix multiplication would apply here, but existing libraries cannot be directly used as they are specialised to linear algebra. The approach used would likely consider vectorisation, tiling, and sparsity. Notably however, techniques such as Strassen's algorithm do not apply in a semiring. A benchmark suite of matrix multiplications encountered in our research can serve as the evaluation of this project, with the aim of computing the results efficiently on either CPU or GPU. As an extension, the project can also consider optimising the matrix representation; the matrices we encounter are somewhat sparse, and this sparsity could be used to speed calculations.
SystemVerilog Floating Point Library
Contact: Jonathan Woodruff
The Ariane/CVA6 processor is one of the most-used open source RISC-V processors. Unfortunately, the floating point unit is both quite old and not well verified (github.com/openhwgroup/cvfpu), particularly for the 64-bit, double-width ISA extension. This project would factor out the key functions that express floating point operations into an independently useful SystemVerilog library, and a set of formally verified properties using the open-source SymbiYosys tool to ensure that the functions implement IEEE-standard floating point operations. This project would then modify the CVA6 processor to use the verified library to implement the “f” and “d” extensions (float and double). An extension of this project would implement some timing or area optimisations of the library functions, using the formal verification tooling to efficiently arrive at a correct implementation.
Move Elimination for Toooba
Contact: Jonathan Woodruff or Karl Mose
“Move elimination” is a microarchitectural feature in processors that detect effective move operations, and implements them in the Rename stage of the pipeline, thus bypassing Execute and skipping that link in the dependency chain. Toooba is an open-source superscalar, out-of-order RISC-V research processor written in Bluespec SystemVerilog. This project would implement move elimination in this core, recognising common types of move operations, and augment the Rename engine to be able to map the result of this instruction to the physical register of the operand. This project would design a verification strategy to verify correctness, and then evaluate the performance improvement in self-designed microbenchmarks, as well as larger benchmarks in simulation. An extension might eliminate moves through memory (stack push/pops). Another extension might evaluate area overhead in hardware, and performance benefit of larger benchmarks running in hardware.
Content-directed Prefetcher for Toooba
Contact: Jonathan Woodruff
Cache prefetching is a crucial technology for overcoming the effects of memory latency. Toooba is an open-source superscalar, out-of-order RISC-V research processor written in Bluespec SystemVerilog. Toooba has several basic prefecture options, including a stride prefetcher and a Markov chain prefetcher. Content-directed prefetching identifies pointers in memory and attempts to learn what pointers are likely to be dereferenced to dereference them early, potentially following a chain of pointers through memory before the program does, with the result that the program perceives no cache misses from that path. This project would implement a Content-directed Prefetcher as closely as possible to the originally proposed algorithm in the academic paper for the Toooba core. The project would design a verification strategy, and evaluate performance using benchmarks in simulation, such as the Olden benchmark suite. An extension might analyse performance inefficiencies and perform optimisations on the content-directed prefetching design that diverge from those proposed in the original publication.
Older project suggestions
Here some project suggestions from previous years that may still be viable for this year, or might provide inspiration for your own ideas.
Cache Zeroing Extension for Toooba
Contact: Jonathan Woodruff or Peter Rugg
RiscyOO (currently called Toooba) is a parameterisable superscalar, out-of-order RISCV implementation in Bluespec SystemVerilog. The CHERI research group is using Toooba for security-extension research. However Toooba lacks support for the new cbo.zero instruction, which zeros an entire cache block/line. This extension is very useful to enforce security primitives, such as zeroing heap allocations on free, or zeroing the stack before return. This project would implement cbo.zero for CHERI-Toooba, plumbing the special memory operation into the cache where an entire line can be written with zeros in a single cycle, taking care to update appropriate state in the load/store queues and store buffer, if necessary. Testing would be done with the TestRIG framework. This project would then perform a thorough evaluation of the performance improvement for various state zeroing protections with the new instruction. As an extension, this project may evaluate on FPGA with large-scale applications, or may explore further hardware zero-cache-line optimisations, such as storing zeroed cache lines more efficiently in cache.
Parameterising a superscalar, out-of-order core down to a scalar, in-order pipeline
Contact: Jonathan Woodruff or Peter Rugg
RiscyOO (currently called Toooba) is a parameterisable superscalar, out-of-order RISCV implementation in Bluespec SystemVerilog. This project would extend the Toooba project with custom modules and further parameterisation of current modules to allow a reasonably efficient single-issue, in-order core. This would greatly extend the usable range of implementations that can be produced from the single Toooba code base, aiming to test the hypothesis that, with proper engineering, it may be possible to maintain a single, open-source processor design to meet a wide range of performance/area targets. Recent Konata visualisation support in Toooba will enable visualisation of the pipeline performance. Performance would be evaluated in simulated MiBench benchmarks, as well as CoreMark. An extension would measure area and timing on FPGA.
Parameterising a superscalar, out-of-order core down to a dual-issue, in-order pipeline
Contact: Jonathan Woodruff or Peter Rugg
RiscyOO (currently called Toooba) is a parameterisable superscalar, out-of-order RISCV implementation in Bluespec SystemVerilog. This project would extend the Toooba project with custom modules and further parameterisation of current modules to allow a reasonably efficient dual-issue, in-order core. This would extend the usable range of implementations that can be produced from the single Toooba code base, aiming to test the hypothosis that, with proper engineering, it may be possible to maintain a single, open-source processor design to meet a wide range of performance/area targets. Recent Konata visualisation support in Toooba will enable visualisation of the pipeline performance. Performance would be evaluated in simulated MiBench benchmarks, as well as CoreMark. An extension would measure area and timing on FPGA.
Extending a superscalar, out-of-order core to allow multiple memory requests per cycle
Contact: Jonathan Woodruff or Peter Rugg
RiscyOO (currently called Toooba) is a parameterisable superscalar, out-of-order RISCV implementation in Bluespec SystemVerilog. Toooba currently only allows a single memory pipeline, though the number of integer pipelines and floating point pipelines are parameterisable. This project would extend the load/store queue to have a vector of interfaces to allow multiple memory pipelines to execute in the same cycle. This project would also bank the L1 cache so that multiple loads could execute per cycle, perhaps only to interleaved subsets of cache lines. This would relieve a mavor bottleneck, and dramatically extend a single parameterised design further into high-performance configurations.Recent Konata visualisation support in Toooba will enable visualisation of the pipeline performance. Performance would be evaluated in simulated MiBench benchmarks, as well as CoreMark. An extension would measure timing and area on FPGA.
Perceptron Branch Predictor for Toooba
Contact: Jonathan Woodruff or Peter Rugg
RiscyOO (currently called Toooba) is a parameterisable superscalar, out-of-order RISCV implementation in Bluespec SystemVerilog. Toooba already supports a small suite of branch predictors, including GSelect, GShare, and a tournament predictor. This project would develop a modern “perceptron” predictor, based on published literature and open publications. This project would develop a branch predictor module in Bluespec SystemVerilog, and then integrate the hardware simulation into the ChampSim framework in order to study behaviour and performance in comparison to state-of-the-art simulated branch predictors. This project would then integrate the new perceptron branch predictor into Toooba and measure performance improvement in simulation, including MiBench benchmarks and CoreMark. An extension would synthesise for FPGA, evaluating area and timing, and performance on SPEC benchmarks.
Memory Renaming Limit Study
Contact: Jonathan Woodruff
X86 processors have long supported limited “memory renaming” to accelerate stack operations. In the decode stage of the pipeline, it can be known that memory operations will alias even if the full address is not known. For example, if a store at an immediate offset of 32 followed several instructions later by a load from the same offset, and if the stack pointer is not modified, it can be known that the loaded value will be the same value that was stored, and the pipeline may simply assume the original physical register holds the value that will be loaded, breaking the dependency through memory. Stated more clearly, the value in the sp[32] memory location is renamed in the pipeline to a physical register. This idea can be further generalised by tracking immediate pointer arithmetic in decode to identify aliasing memory locations even as pointers in registers are changing.
This project would perform a study of how applicable this technique is to RISC-V programs. Both static RISC-V binaries and dynamic RISC-V instruction traces would be analysed to determine the opportunities for memory renaming in these programs. For example, how many load values can be statically known from previous stores given various instruction windows? As an extension, this project could look at address aliasing prediction, to explore to what extent aliasing addresses can be predicted without perfect knowledge, potentially leading to flushes if a register value was forwarded in error.
Vector runahead
Contact: Timothy Jones
To address the widening performance gap between CPU cores and main memories, designers have implemented prefetchers into various levels of the cache hierarchy, so as to bring data close to the processor before it is needed, meaning it is available in fast storage at the point of use. There are a wide variety of data prefetchers available, but few that can accurately identify data that is accessed through complex data structures.
An alternative scheme is Runahead execution. Here, when the processor is stalled, it continues speculatively fetching an executing instructions from the future so as to perform their memory accesses, then discards them and re-executes them correctly once the pipeline starts up again. This provides an accurate form of prefetching within the core, rather than as separate logic beside the cache. Until recently though, Runahead techniques couldn't deal with complex access logic either. However, a new scheme, called Vector Runahead, can effectively prefetch these access patterns, providing significant performance increases for certain workloads. The aim of this project is to implement Vector Runahead in the gem5 simulator to reproduce the impressive results obtained, with more advanced extensions possible too.
This project is fairly involved and should only be tackled by someone with strong C++ coding skills.