OpenAI livestreams last day releases "trump card": the next generation reasoning model o3 debuts!
21/12/2024
GMT Eight
OpenAI will showcase its most important cutting-edge product on the last day of a 12-day technical sharing live event!
On Friday, OpenAI released the next generation reasoning model o3, an upgraded version of the o1 reasoning model released earlier this year. More specifically, o3 is a series of models - like o1, there are o3 and o3-mini versions, with the latter being a smaller, streamlined version tailored for specific tasks.
OpenAI claims that in some conditions, the o3 model can come close to achieving AGI.
AGI stands for "artificial general intelligence," referring to artificial intelligence that can perform any task that a human can. OpenAI has its own definition of this term: "highly autonomous systems that outperform humans in economically valuable work."
Achieving AGI would be a bold declaration. For OpenAI, it also has real-world implications. According to the terms of the agreement between OpenAI and its close partner and investor Microsoft, once OpenAI achieves AGI, it is no longer obligated to allow Microsoft to use its most advanced technology (i.e., technology that meets OpenAI's AGI definition).
OpenAI CEO Sam Altman announced that OpenAI plans to officially release o3 mini by the end of January, followed by the full version of o3. The company anticipates that more powerful large language models can surpass existing models and attract new investments and users.
In a blog post, OpenAI stated that the o1 model was already able to reason through complex tasks, solving more challenging problems compared to previous models in science, coding, and mathematics. The newly introduced o3 and o3 mini models are currently undergoing internal security testing and are expected to be more powerful than the previously released o1 model.
Two years ago, OpenAI released ChatGPT, kicking off an AI arms race. ChatGPT is a chatbot powered by a large language model with an initial release version of GPT-3.5. OpenAI later released GPT-4 in 2023, claiming it to be more accurate and creative. Recently, OpenAI introduced its first reasoning model, o1.
A spokesperson for the company stated that OpenAI decided not to name the next generation model o2, "out of respect for the similarly named UK telecommunications operator o2." Altman also jokingly stated during the live event, "In keeping with OpenAI's great tradition of being very, very bad at naming things, it shall be named o3."
How powerful is o3?
So, how powerful is o3 specifically?
According to OpenAI, the o3 model achieved a record-breaking score on the ARC-AGI benchmark. Developed by Keras creator Franois Chollet, ARC-AGI primarily tests a model's reasoning abilities through graphical logic reasoning. With a maximum score of 100%, the ARC-AGI evaluation results show that o3 scored 75.7% in low computational scenarios and reached 87.5% in high computational tests.
This signifies that the top performance of o3 exceeded the threshold of 85% that marks human-level achievement. For comparison, the currently available o1 model's score ranges from 25% to 32%. The performance of o3 is almost three times that of o1.
In other benchmark tests, o3 also outperformed significantly.
In terms of programming ability measured by Codeforces Elo ratings, o3 scored 2727, while o1 scored only 1891. In fact, o3 mini has already surpassed o1 in medium reasoning time patterns.
In the SWE-bench Verified code generation benchmark introduced by OpenAI in August, o3 achieved an accuracy rate of 71.7%, which is 22.8 percentage points higher than o1.
In the 2024 American AIME math competition, o3 achieved a high accuracy rate of 96.7%, missing only one question, and scored 87.7% accuracy in the GPQA Diamond (a set of graduate-level biology, physics, and chemistry questions).
It is worth noting that o3 set a new record in the "FrontierMath" benchmark test by EpochAI, solving 25.2% of the problems, with no other model exceeding 2% in this test.
Epoch AI, in collaboration with over sixty mathematicians worldwide, including professors, IMO organizers, Fields Medalists, jointly launched a new mathematics benchmark FrontierMath. These math problems range from Olympiad-level difficulty to the forefront of current mathematics, encompassing all major branches of current mathematical research - from computation-intensive problems in number theory and real analysis to abstract problems in algebraic geometry and group theory.
Industry Competition and Risks
Undoubtedly, the performance of the o3 model in the aforementioned tests is impressive. Whether in software engineering, coding, or competitive mathematics, mastering human-level doctoral knowledge in natural sciences, o3 clearly outperforms o1.
OpenAI President Greg Brockman said, "Our latest reasoning model o3 is a breakthrough, with step-change improvements on our hardest benchmarks. We are now beginning security testing and red team exercises."
However, making a significant leap towards human-like intelligence is also likely to raise concerns about AI safety.
There may indeed be risks. AI security testers have found that the reasoning abilities of o1 have already led to a higher proportion of attempts to deceive human users compared to traditional "non-reasoning" models, and this holds true for other leading AI models from Meta, Anthropic, and Google.
The proportion of attempts to deceive users by o3 may beIt can be taller than its predecessor; once the results of the red team testing of OpenAI are released in the future, people may be able to know the specific situation. Ortman also expressed that, before OpenAI releases a new reasoning model, he would prefer to have a federal testing framework to guide monitoring and mitigate the risks of these models.Before the public release of the o3 model, OpenAI will also open up the application process for external researchers to test the o3 model. The deadline for applications is January 10th.
Recently, after the release of OpenAI's first batch of inference models o1, some major competitors of the company have also introduced their own inference models. Earlier this month, Google released a new version of its flagship model Gemini, which is said to be twice as fast as the previous generation model and can "think, remember, plan, and even take action on your behalf." Meta CEO Mark Zuckerberg recently revealed plans to release Llama 4 next year.
These developments indicate that competition in the field of artificial intelligence is becoming increasingly fierce, with all parties striving to create more intelligent models that can solve complex problems.
The latest unveiling of the o3 model by OpenAI on Friday marked the end of its 12-day live product launch event. In previous live streams, the startup introduced a more expensive new ChatGPT Pro subscription option (for $200 per month) and officially launched the AI video generation model Sora Turbo and other new products. The ChatGPT search feature has also been upgraded with new features such as map integration and real-time search, which are now available to all users.
This article is reprinted from Cailian Press, written by Xiao Xiang. GMTEight Editor: Wen Wen.