reasoning models

September 27, 2025


from my experiments with using gtp5 to label data, even with higher reasoning efforts, the model can still get obvious things wrong, even with explicit instructions. these models were trained to optimize for the right answers, and its possible that they can reason badly, but still end up with the correct answer, and this could explain the behaviour here. the model can get a 1.0 score by just stumbling on the right answer under current RL training methods.