CITIC SEC: OpenAI o1 Reasoning Upgrade Focuses on New Opportunities in Reinforcement Learning.

date
18/09/2024
avatar
GMT Eight
CITIC SEC released a research report stating that OpenAI's o1 model has upgraded its thinking chain and reinforcement learning, focusing on improving the model's reasoning performance. The capabilities in logic-based fields such as coding, mathematics, and science have been greatly enhanced, continuously exploring new ways to achieve Artificial General Intelligence (AGI). The new model has significantly increased the demand for computing power in both training and reasoning ends, leading to a rise in the prosperity of the computing power industry chain. It further reduces the development costs of applications in various fields, expands the coverage of strong logical scenarios, accelerates the implementation of applications in various fields. It is suggested to continue monitoring the top AI companies in related fields. The main points of CITIC SEC are as follows: Event: In the early morning of September 13, Beijing time, OpenAI released the o1 model. The o1 model possesses complex reasoning capabilities, reaching top levels in coding, mathematics, and science fields. The o1 model can break down tasks into multiple simple tasks, optimize them to form a complete thinking chain, and enhance the logical, comprehensive, and accurate answers. The preview version is already open to Level 5 API users and will be prioritized for enterprise and academic users next week. According to the company's official website, in programming, the model surpasses over 83% of professional personnel in Codeforces competitions. In mathematics, using the 2024 American Mathematical Invitational as a test set, o1 can solve 74% of problems in a single generation, further increasing the correct rate to 83% in multiple generations, while GPT-4 can only solve 12% of problems. In science, the model has a correct rate of 78% on the GPQA Dimond test set, surpassing human experts' level of 70%. Technical analysis: Reinforcement learning + LLM seek the optimal path, the reward model's generalization issue requires verification. Referring to the OpenAI official website and DeepMind's paper "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (Charlie Snell, Jaehoon Lee, Kelvin Xu, etc.), we speculate that the o1 model will break down complex tasks into a chain of multiple tasks forming a thinking chain and adopt a pattern similar to reinforcement learning (RL), optimizing the behavior of each node to seek the optimal path. The reward model determines the optimization direction, with domains that have clear evaluation criteria being more advantageous. Therefore, the o1 model performs better in coding, mathematics, and science fields, while temporarily falling short in tasks such as writing and editing compared to GPT-4o. Whether the reward model can accurately evaluate other outputs in the future is one of the core issues in the development of related technical routes. Computing power investment: Reinforcement learning leads to a doubling of computing power investment and inference costs. According to OpenAI's calculation, the computing power investment for reinforcement learning on the training and reasoning ends still remains proportional to the model's effectiveness. Large models gain new and effective ways of computing power investment, which is expected to bring several times more computing power requirements than traditional large models. The computing power demand on the model's reasoning end has significantly increased. According to the company's official website, the current o1-preview model can generate content in minutes, with API pricing inputs at $15/million Token and outputs at $60/million Token, showing a several times increase compared to GPT-4o's input at $5/million Token and output at $15/million Token. According to NVIDIA CEO Huang Renxun's speech at the Communacopia + Technology conference, there is a strong demand for the Blackwell series overseas, leading to a continuous rise in demand for computing power in the industry. Application outlook: Accelerate cost optimization, B-end benefits from Agent capacity improvement ahead. In the short term, the focus will be on strong logical fields such as coding, mathematics, and science, where AI code generation will boost efficiency in all fields. According to Microsoft's financial report, overseas GitHub Copilot has over 1.8 million paid users in Q1, with the coding assistant tool at the Industrial and Commercial Bank of China's software development center accounting for over 32% of the total code volume. The o1 model's coding ability is expected to further enhance AI-assisted development efficiency. In the future, the model is expected to expand to more industries through the reward model generalization, accelerate coverage in edge industries and scenarios, with the use of thinking chains combined with tools, knowledge bases, and other abilities, forming stronger Agent performance to meet the demand for strong logical tasks such as summarizing, analyzing, alerting, forecasting, and management in enterprises. Risk factors: Unexpected development in core AI technology, AI being misused causing serious social impact, risks to enterprise data security, information security, and intensified industry competition. Investment strategy: OpenAI's o1 model focuses on upgrading its thinking chain capabilities, combining reinforcement learning to enhance capabilities in coding, mathematics, science, and other strong logical fields, continuously exploring ways to achieve AGI. The new model synchronously drives a doubling of computing power requirements on both training and reasoning ends, leading to a continuous rise in the prosperity of the computing power industry chain. The application end further reduces the development costs in various fields, expands the coverage of strong logical scenarios, and both C-end and B-end applications are expected to accelerate. It is advisable to continue monitoring the top AI companies in relevant fields.

Contact: contact@gmteight.com