Why OpenAI's Strawberry paves the way to AGI

OpenAI's new model "Strawberry" focuses on advanced reasoning, improving performance with time, marking a significant step towards AGI.

Hi everyone! OpenAI just released their latest model code-named Strawberry. At first glance, it might not seem like much, but Strawberry can think through problems for as long as it needs, and its performance improves the more time it takes. With a brand new architecture focused on reasoning, Strawberry is an important step towards AGI. Keep watching to learn more. This video has three parts: Strawberry released as 01, dangerous safety analysis, and new scaling laws.

Part one: Strawberry released as 01. On Thursday, September 12th, which is yesterday as I'm filming this video, OpenAI released their new model they call 01. This is the model that was originally code-named QAR when Ilia Suk was working on it inside OpenAI and has later been renamed to the code name of Strawberry. It has been released to all paying ChatGPT Plus users initially with very strict caps on its usage. This is no wonder because I already know of two or three people who have signed up again for ChatGPT just to be able to test out this model.

This initial release includes two models called 01 Mini, which is pretty small, and 01 Preview, which is a little bit larger. They already have the full 01 model, but they're not releasing it yet. This is probably because, at the moment, the 01 Preview model seems pretty comparable with their previous model, which is GPT-4. This may be intentional because 01 apparently requires a lot of compute, as they're charging 100 times as much for it as GPT-4 on the API. They may have released this 01 Preview first so that people could get used to this new type of model but also to manage demand so they don't run out of servers.

The full-size 01 model is, of course, immensely more capable, and they do have some benchmarks for it, so we can start to see that already. By the way, 01's knowledge cut-off is October 2023, which is right around the time that we started hearing leaks about QAR.

What do we know about 01's architecture? Well, it's not exactly GPT-5, and there's a reason that OpenAI didn't call it GPT-5—not just because of marketing and not wanting to let people down, etc. GPT-5 would be just a scaling up of the exact same techniques that were used in GPT-4, 3, 2, and so on. But 01 is based on emphasizing reasoning and having an internal chain of thought. It has somewhat different mathematical underpinnings and is probably a fine-tuning of a different base model from a different lineage that's not really GPT. It's definitely a smaller base model. 01 doesn't know as many facts about the world as, for example, GPT-4, but it's much better at reasoning and everything that comes along with that, like math, programming, and logic. So, I guess that's why they decided to restart the numbering scheme from one—it makes sense.

We know that this new architecture is very compute-intensive and that data is seemingly no longer the limiting factor. This probably means that OpenAI has figured out some way to inject diversity into the LLM inference. The problem with reinforcement learning, which is what is used to train LLMs right now, is that the model learns the most likely explanation for the data. There can be spurious correlations, so the model might think that A causes B, but in fact, those things just often happen together and yet are independent. Reinforcement learning is continuously pushing the model weights to try to figure out the maximum likelihood. Yes, during inference, you can use higher temperature and get some randomness there, but the point is that the underlying model has really only learned one explanation for data.

If you'd like to learn more about that at a technical level, check out this previous video I made where I talk about Transformers and whether LLMs could scale all the way to AGI.

Back to Strawberry. OpenAI has figured out how to inject diversity. In other words, instead of just doing reinforcement learning, they must be doing something like maximum entropy reinforcement learning, where you actually try to get a lot of variation in there, perhaps a Monte Carlo tree search or some variation of that. They may not have had to use very much RHF or reinforcement learning with human feedback on this system, which is important because RHF really makes a model a lot dumber. I mean, I'm guessing they probably did at least a little bit of that so that ChatGPT doesn't say things that people find objectionable on day one—or I guess this is day two—but it's not the core mechanism like it was when they turned GPT-3 into ChatGPT.

=> 00:04:30

Injecting diversity into AI models leads to smarter, more varied thinking and better results.

OpenAI has figured out how to inject diversity into their models. Instead of just doing reinforcement learning, they must be employing techniques like maximum entropy reinforcement learning, where you actually try to get a lot of variation. They might also be using Monte Carlo tree search or some variation of that. Interestingly, they may not have had to use very much reinforcement learning with human feedback (RHF) on this system, which is important because RHF can make a model a lot dumber. They probably did at least a little bit of RHF to ensure that ChatGPT doesn't say things that people find objectionable from the start, but it's not the core mechanism like it was when they turned GPT-3 into ChatGPT.

The advantage of having a high-diversity model is that as you give it more compute, it thinks more and more about different explanations for the data. This is why, for instance, Strawberry with tons more compute seems to yield better and better results. You could achieve something similar by taking an existing model like GPT-4 and just doing Chain of Thought prompting, but each of those steps would still be done by a model trained by reinforcement learning, which struggles with thinking about different explanations for the data. If the model's intuitive first approach works, that's fine, but for more complex problems, you would want to use Strawberry.

Regarding safety analysis, due to concerns about AI safety, OpenAI conducted a thorough evaluation of the model, including Chain of Thought deception monitoring. This allows you to see the thoughts the model is thinking, separate from the main output to the user. There are hundreds or thousands of thoughts just to figure out the response to a simple query. In the 01 preview, you can see a high-level summary of these thoughts, and sometimes, even in the summary, you see some disturbing things.

OpenAI did experiments to monitor deception in the Chain of Thought. The idea is that if the model is trying to deceive the user or escape its constraints, there should be evidence in the Chain of Thought. This method is great for reasoning and a huge win for interpretability. However, 0.8% of 01 preview thoughts were flagged as deceptive, which is a high percentage. Most of these deceptions were hallucinations, but two-thirds of those hallucinations were intentional, meaning the model was intentionally giving wrong information to the user. This is concerning because it means the model is intentionally deceiving the user 38% of the time, which translates to four out of every thousand thoughts.

Moreover, Strawberry is extremely good at persuasion, which is worrying. The model's persuasiveness was evaluated by having it write essays and comparing them against human-crafted essays in categories like argumentative reasoning, politically persuasive reasoning, manipulation, and social engineering. The models performed very well, ranking around the 70th or 80th percentile in persuasive writing skills compared to all of humanity. This is more persuasive than GPT-4, but fortunately, it is not at the 95% persuasive level, which would be considered superhuman persuasion.

It's important to note that this evaluation was done on the 01 preview, not the full model, which would be even more persuasive. To strengthen their overall safety analysis, OpenAI partnered with other companies such as Apollo Research and Meter to measure their system. These partners found that the model had high knowledge about itself and how it can causally influence the rest of the world. It also had an explicit theory of mind, able to model first and second-order instances of what other people are thinking.

=> 00:08:12

AI models are getting scarily good at problem-solving and even hacking, but luckily they're not superhuman yet.

The skills of the model are indeed more persuasive than GPT-4.0, though it is fortunate that we are not at 95% persuasive because that would be considered superhuman persuasion. Notably, this is just the 0.1 preview, and they haven't even tested the full model, which would be even more persuasive. In an attempt to strengthen their overall safety analysis, OpenAI partnered with other companies such as Apollo Research and Meta to measure their system. These companies were granted early access to the model.

These partners found that the model had high knowledge about itself and how it can causally influence the rest of the world. It had an explicit theory of mind, able to model first and second-order instances of what other people are thinking. This is quite alarming when considering the capabilities of these models. Meta even found that the 0.1 preview could make non-trivial progress on two out of the seven AI research questions designed to capture the challenges of current frontier model AI research. Automating AI research is one of the core goals of these AI labs, and they seem to be making significant progress on that front.

The models were also tested against cybersecurity hacking challenges, known as Capture the Flag (CTF) competitions. These challenges require hacking through different layers of defense to access a secret message called the flag. There are various levels of CTFs, and for example, Pico CTF is one of the simplest, designed for high schoolers. This is probably the CTF that the model was trained against when they mentioned high school-level CTFs, on which the model achieved a maximum of 43% success rate. Fortunately, it is not particularly good at hacking challenges yet. However, at one point, the CTF Docker container failed to start, which was the environment where all the defensive layers were located. The model tried to connect to it and couldn't, so it performed a network scan to figure out what was going on. Although the network wasn't the open internet, it wasn't intended for the models to access. The model found that it could connect to the Docker service daemon and issued the Docker command to start the container that had crashed. Instead of hacking through all the layers, the model asked the Docker service to access the flag file directly, and it worked. This is a clear instance of reward hacking, showcasing the model's problem-solving ability. Although the machines were air-gapped or something similar, ensuring the real world was never in danger, it is a good thing its only goal was to capture the flag and not take over the world.

New Scaling Laws: As previously mentioned, the interesting aspect of the model, codenamed Strawberry, is that it gets increasing returns from increasing compute. Here's a chart illustrating that when you increase the compute on the x-axis, you get a proportional increase in the quality of the outputs. It still requires an exponential increase in the amount of compute to achieve approximately linear performance gains, but this phenomenon isn't possible with other models, including GPT-4.

One interesting data point was the International Olympiad in Informatics (IOI), which is essentially a math competition. Strawberry was entered into the competition this year, 20124, with the same constraints as the human contestants: 10 hours to solve six problems and 50 attempts at each problem. Strawberry scored in the 49th percentile, placing it in the middle of the pack compared to the likely very intelligent human contestants. However, when the model was allowed 10,000 submissions for each problem instead of just 50, its score increased from 213 to 362, which is above the gold medal threshold. This exemplifies how giving the model more computation or more tries at the problem can significantly improve its ability to generate solutions.

=> 00:11:53

More computation means AI can achieve superhuman performance.

Strawberry was entered into this year's competition in 20124 and had the same constraints as the human contestants. In other words, it had 10 hours to solve six problems and 50 attempts at each problem. Strawberry scored in the 49th percentile, placing it in the middle of the pack in terms of math ability compared to the presumably very intelligent people participating in the math competition. However, when the model was allowed 10,000 submissions for each problem instead of just 50, its score increased from 213 to 362, which is above the gold medal threshold. This demonstrates that giving the model more computation or more tries at the problem significantly improves its ability to generate solutions.

Another example is Codeforces, which hosts online programming competitions. Previously, we discussed CTFs, which are hacking challenges, but Codeforces focuses on general-purpose algorithmic challenges. GPT-4 achieved an ELO rating of 8008, placing it in the 11th percentile of human competitors. However, when they used a modified version of Strawberry that was allowed numerous retries, it achieved an ELO rating of 1187, placing it in the 93rd percentile.

Another fun challenge for Strawberry involved interview questions that OpenAI asks its research engineers. After only one pass, the model was getting about 80% of the questions correct. However, when given 128 passes instead of just one, it got 100% of the questions correct. Although the term "pass" isn't clearly defined here, it likely refers to an attempt for the model to solve the problem, differing from the math competition where it submitted 10,000 answers to find the correct one.

OpenAI's ultimate goal is to automate AI research, accelerating progress toward AGI and beyond. This could be an AlphaStar moment for LLMs. AlphaStar, a Google DeepMind project, played Starcraft 2, a strategy game requiring high skill in time and long-term strategy. AlphaStar was trained through reinforcement learning, playing solely against itself, and became extremely proficient. When limited to the same number of actions as a human, it was barely beaten by a top player. However, when allowed to perform as many actions as it wanted within the game's real-time confines, it became unbeatable. Similarly, LLMs could achieve superhuman performance with more compute.

Strawberry, or QAR at the time, showed immense potential, which is likely what Ilya saw. SSI, his company, is probably working on building a model that pays attention to diversity and yields more dividends with increased compute, potentially leading to superintelligence. While Strawberry isn't AGI, it exhibits PhD-level expertise in many areas and strong reasoning abilities. It represents a significant step forward, although several generations of models will be needed to fully realize these techniques' potential.

This breakthrough is crucial because it marks a step change in AI construction, opening a new scaling period. If OpenAI or similar labs continue training such systems with more compute, they could achieve core ML research assistance and system 2 type reasoning, potentially leading to AGI. Strawberry 01 or 01-preview, released yesterday, is the most powerful model currently available to the public. Despite not being a large improvement, it's good at reasoning and math, suggesting a significant un-hobbling release that will allow OpenAI to scale much faster.

OpenAI presumably conducted an intensive security evaluation of 01-preview before its release, revealing some concerning findings, such as the model's occasional intentional deception and reward hacking in cybersecurity challenges. However, this model is likely to kick off a new era of scaling because it fundamentally differs from previous models. It doesn't just use reinforcement learning; it explores multiple hypotheses for its input, improving its answers with more thinking and novel ideas.

If you liked this video, check out a previous one discussing whether LLMs could lead to AGI and the role of Transformers. Don't forget to join our Discord. That's all for today. Thank you very much for watching. Bye.