Nathan Lambert: Reverse engineering OpenAI's o1

Oct. 24, 2024 • quotes, generative AI, software engineering

OpenAI's new o1 and o1-preview models show remarkable progress in complex reasoning tasks such as mathematics and programming.

To achieve this, the models use an internal chain of thought that uses more computation during inference time. OpenAI has been tight-lipped about the technical details, but I was intrigued to find out more.

I found Nathan Lambert's blog post Reverse engineering OpenAI's o1 very helpful. Here are some of the things that stood out to me.

Rather than a linear "chain of thought", o1 likely uses "tree search", as well as a mechanism to find the best reasoning path within that tree.

For each reasoning step shown to the user through the vague summary, o1 models generate multiple candidates that they then rate after an end-of-step token.

The blog post contains a helpful illustration of this concept.

The tree search approach reminds me of the idea of divergent and convergent thinking, a well-known concept in design thinking. This method allows to explore a wider area of the solution space, and then select the most suitable option.

"Two" models

Although OpenAI claims that o1 is one model, I find it helpful to think of it as performing two separate functions for each reasoning step.

Essentially, it is one model, but the model is a generative model and a process reward model all in one.

Back-tracking

By rating multiple future reasoning steps before proceeding, the model has opportunities to fix its errors.

The ability to "back-track" and try an alternative path is a huge shift in model behavior. I found this to be most apparent in the crossword puzzle example on the o1 blog post.

Reinforcement Learning

o1 uses Reinforcement Learning (RL) to train the reward model, starting with human labelled data.

Over a year ago, OpenAI likely paid high-skill annotators to create complex forward reasoning paths, likely with different paths for single problems. These can be rated and turned into initial labeled trajectories. It is likely that contrastive examples were needed too [...]

In addition, it is likely that OpenAI is also using "LLM as a judge".

At least at training time, it is almost a sure thing that more models are used as heuristics to rate the highest-quality reasoning traces.

I have to admit that I'm struggling to wrap my head around the RL part of the story. When training ML models on (computer) games, there is usually a straightforward reward function, such as a score. But how do you get that to scale for reasoning trajectories?

I find it hard to imagine that one LLM could be used to improve the reasoning capabilities of another LLM. But maybe it is possible, because choosing the best option from several candidates is easier than coming up with the right solution in the first place? I guess we will find out.


Recent posts