**Abstract:** Modern large-scale language model training runs often use several orders-of-magnitude more tokens in the final training run than in the experiments in the leadup to the run. How can we confidently extrapolate the results of these small-scale experiments to make predictions of the likely final outcome of our large-scale run?
In this talk I will present two methods I've worked on to solve this problem. The first assumes that model performance on binary-outcome tasks can be modelled as a sigmoid regression based on the model loss on some validation data. Long-horizon performance can then be read off the parameters of this regression. This method also lets us verify the validation sets we use to measure model progress based on how accurately they predict long-horizon outcomes. The second aims to solve the mismatch between infinite- and finite-data regimes by artificially inducing finite-data effects at small token horizon by subsampling training data. We show that such 'repeat-aware' experiments help us more accurately determine the optimal data mixture for long-horizon experiments based on short-horizon runs.
**Speaker Biography:** Kris Cao is a member of the Technical staff at Cohere, working on model pretraining, tokenization, trustworthy evals, signal at small scales, model optimization, and data infrastructure. Kris previously completed his undergraduate and postgraduate studies at the University of Cambridge, before taking up a position as a researcher at Google DeepMind. Kris completed his PhD in the Natural Language & Information Processing (NLIP) Group, with a thesis on "Learning meaning representations for text generation with deep generative models", supervised by Dr Stephen Clark.
