The alignment of large language models (LLMs) can often be brittle when faced with the complexities of real-world deployment. In this talk, I share our investigations on two scenarios where special care is required to ensure robust alignment.
The first scenario is multi-objective alignment, where balancing competing objectives is particularly challenging. Our recent work, **Robust Multi-Objective Decoding (RMOD),** an inference-time alignment algorithm, adaptively adjusts the weights of different objectives during response generation to ensure none are neglected. RMOD provides principled robustness with minimal overhead, consistently outperforming existing methods across several alignment benchmarks.
In the second part of the talk, I will address preference model misspecification in self-play alignment. While self-play is a promising alignment approach, naive implementations are vulnerable to inaccuracies in the preference model. To address this, our **Regularized Self-Play Policy Optimization (RSPO)** framework offers a versatile and modular method for regularizing the self-play alignment process. RSPO’s ability to combine various regularizers results in strong performance gains on multiple evaluation sets, such as AlpacaEval-2 and Arena-Hard.
As a bonus, I will briefly introduce our recent investigation into the robustness of **Mixture-of-Agent (MoA)** systems, a popular multi-agent paradigm. We show that even a single malicious agent introduced into the mixture can nullify the benefits of the entire system.