In this talk, I will present CodeScaler, a novel framework designed to overcome the scalability bottlenecks of Reinforcement Learning from Verifiable Rewards (RLVR) in code generation. While traditional RLVR relies heavily on the availability of high-quality unit tests—which are often scarce or unreliable—CodeScaler introduces an execution-free reward model that scales both training and test-time inference. By leveraging carefully curated preference data, syntax-aware code extraction, and validity-preserving reward shaping, CodeScaler achieves significant performance gains, improving the Qwen3-8B-Base model by an average of +11.72 points across five benchmarks. Furthermore, CodeScaler functions as a highly efficient test-time scaling method, delivering performance comparable to execution-based approaches while reducing latency by 10$\times$. I will discuss how this approach enables robust optimization on synthetic datasets without the need for test cases and its broader implications for enhancing reasoning capabilities in general domains.
