Alibaba releases flagship reasoning model Qwen3-Max-Thinking

15:21 27/01/2026

Alibaba officially launches the Qwen3-Max-Thinking flagship reasoning model in the Thousand Questions series.

On January 26th, Alibaba officially launched the Qwen3-Max-Thinking flagship reasoning model in the Thousand Questions series. It is reported that Qwen3-Max-Thinking has significantly improved in several key dimensions, including factual knowledge, complex reasoning, instruction following, alignment with human preferences, and intelligence capability. In 19 authoritative benchmark tests, its performance is comparable to top models such as GPT-5.2-Thinking, Claude-Opus-4.5, and Gemini 3 Pro. Qwen3-Max-Thinking introduces two core innovations: (1) Adaptive tool invocation capabilities, allowing on-demand access to search engines and code interpreters, now available in Qwen Chat; (2) Test-time scaling technology, significantly improving reasoning performance, surpassing Gemini 3 Pro on critical reasoning benchmarks. The table below shows a more comprehensive evaluation score: Adaptive Tool Invocation Capability Unlike earlier methods that required manual tool selection by users, Qwen3-Max-Thinking can autonomously select and use its built-in search, memory, and code interpreter functions during conversations. This capability stems from a specially designed training process: after initial fine-tuning of tool usage, the model undergoes further training on diverse tasks using rule-based and model-based feedback. Experiments show that search and memory tools can effectively alleviate illusions, provide real-time information access, and support more personalized responses. The code interpreter allows users to execute code snippets and apply computational reasoning to solve complex problems. These functions collectively provide a smooth and powerful conversation experience. Test-Time Scaling Technology Test-time scaling refers to the technique of allocating additional computational resources during the reasoning stage to enhance model performance. We propose an experience-accumulative, multi-round iterative test-time scaling strategy. Unlike simply increasing the number of parallel reasoning paths (which often leads to redundant reasoning), we limit and allocate the saved computational resources for iterative self-reflection guided by an "experience extraction" mechanism. This mechanism extracts key insights from past reasoning rounds, enabling the model to avoid redundantly deducing known conclusions and instead focus on unresolved uncertainties. Crucially, compared to directly referencing the original reasoning trajectory, this mechanism achieves higher efficiency in utilizing context, effectively integrating historical information within the same context window. With roughly the same token consumption, this method consistently outperforms standard parallel sampling and aggregation methods: GPQA (90.3 92.8), HLE (34.1 36.5), LiveCodeBench v6 (88.0 91.4), IMO-AnswerBench (89.5 91.5), and HLE (w/ tools) (55.8 58.3). Qwen3-Max-Thinking is now available on Qwen Chat, allowing users to interact directly with the model and its adaptive tool invocation capabilities. Additionally, the API for Qwen3-Max-Thinking (model name qwen3-max-2026-01-23) has also been opened for use.

Software crashed together? Roblox (RBLX.US): It has an ecological closed-loop, Genie can't break.

Industrial: Hong Kong stock market sentiment index has reached the bottom area.

"The 'Chinese Choice' for Global SiC Core Customers: Why TIANYU SEMI (02658)?"

Software crashed together? Roblox (RBLX.US): It has an ecological closed-loop, Genie can't break.

Industrial: Hong Kong stock market sentiment index has reached the bottom area.

"The 'Chinese Choice' for Global SiC Core Customers: Why TIANYU SEMI (02658)?"