Haitong: DeepSeek's theoretical profit margin reaches 545%, and it is expected to become a year of explosive growth for large-scale models and their applications in 2025.
04/03/2025
GMT Eight
Haitong released a research report stating that throughout February, China's domestically produced large models have continued to iterate rapidly, and the industry is still in a continuous and high-speed development process. The release of OpenAI 4.5 also confirms that the overseas AI industry has not stagnated and is still actively exploring. The release of DeepSeek during the open source week has fully showcased the innovation in AIInfra and basic technology behind its advanced models, providing valuable inspiration to other developers in the industry. This is expected to further drive the faster development and innovation of the entire AI industry. With a theoretical profit rate as high as 545%, it is clear that the foundation for commercialization of AI has been established, and large AI models have truly become a commercial model that can be profitable and make significant profits. The industry predicts that 2025 is likely to become the year of outbreak for domestically produced large models and applications.
Key points from Haitong are as follows:
- The next-generation fast-thinking model TurboS by Tencent's mixed yuan is officially released
On February 27th, Tencent's next-generation fast-thinking model TurboS was officially released. Unlike slow-thinking models like Deepseek R1 that require "thinking and then answering", TurboS can achieve "instant reply", providing answers faster, doubling the speech speed, and reducing the first word latency by 44%. In terms of knowledge, math, and creation, TurboS also performs well. Slow thinking is more like rational thinking, providing problem-solving ideas by breaking down logic; fast thinking, like human "intuition", provides fast response capabilities for large models in general scenarios. The combination and complementarity of fast and slow thinking can make large models solve problems more intelligently and efficiently. By integrating long and short thought chains, TurboS significantly improves scientific reasoning abilities while maintaining the fast-thinking experience for arts and humanities questions. Based on its self-developed TurboS, mixed yuan also introduced the inference model T1 with deep thinking, which can understand multiple dimensions of a problem and potential logical relationships, making it ideal for complex tasks.
- Turbo S introduces a new upgraded architecture system and simultaneously launches the deep thinking inference model T1
Architecturally, TurboS innovatively adopts the Hybrid-Mamba-Transformer fusion mode, effectively reducing the computational complexity of traditional Transformer structures and reducing the KV-Cache cache usage, resulting in lower training and inference costs. The new fusion mode overcomes the challenges faced by traditional pure Transformer structure large models in terms of long text training and high inference costs. It leverages the efficient processing of long sequences by Mamba while retaining the advantages of Transformer in capturing complex contexts, ultimately constructing a hybrid architecture with both memory and computational efficiency. This is the first successful application of the Mamba architecture in extra-large MoE models in the industrial sector. As the flagship model, TurboS will become the core foundation of Tencent's mixed yuan series derivative models in the future, providing basic capabilities for inference, long text, code, and other derivative models. Based on TurboS, mixed yuan also introduced the deep thinking inference model T1, which understands multiple dimensions of a problem and potential logical relationships, making it ideal for complex tasks.
- Alibaba's video generation large model Wan2.1 officially open source, significantly ahead of competitors like Sora
On February 25th, Alibaba's video generation large model Wan2.1, under the Tongyi division, was officially open-sourced, with 14B/1.3B dual versions released. The professional version 14B offers high performance and provides industry-leading expressiveness, meeting the demands of scenarios with high requirements for video quality; while the rapid version 1.3B is suitable for consumer-level graphics cards, generating high-quality 480P videos with 8.2GB of memory, suitable for secondary model development and academic research. Wan2.1, which was open-sourced this time, has significant advantages in handling complex motions, restoring real physical laws, improving the texture of movies and TV series, and optimizing instruction compliance. Whether creators, developers, or enterprise users, they can choose suitable models and functions according to their needs to easily achieve high-quality video generation. Additionally, Wan2.1 also supports the industry-leading generation of Chinese and English text effects, meeting the creative needs of advertising, short videos, and other fields. In the authoritative evaluation benchmark VBench, Wan2.1 topped the list with a total score of 86.22%, significantly outperforming domestic and foreign competitors such as Sora, Minimax, Luma, Gen3, and Pika.
- GPT-4.5 released, with higher "emotional intelligence"
OpenAI officially released GPT-4.5, its largest and most optimal chat model to date. GPT-4.5 took an important step in expanding pre-training and post-training scales. By expanding unsupervised learning, GPT-4.5 has improved its ability to recognize patterns, establish connections, and generate creative insights without relying on reasoning. Early tests have shown that the interaction with GPT-4.5 feels more natural. Its broader knowledge base, improved understanding of user intent, and higher "emotional intelligence" make it excel in tasks such as improving writing, programming, and solving practical problems. OpenAI also expects a reduction in its "hallucination" phenomenon. GPT-4.5 does not think before responding, distinguishing its advantage from reasoning models like OpenAI o1. Compared to OpenAI o1 and OpenAI o3-mini, GPT-4.5 is a more general and internally intelligent model. OpenAI believes that reasoning ability will be the core capability of future models, and the two expansion methods - pre-training and reasoning - will complement each other. As models like GPT-4.5 become more intelligent and knowledgeable through pre-training, they will provide a more solid foundation for reasoning and tool-based agents.The first day of Open Source Week.Open source efficient MLA decoding kernel optimized for NVIDIA's Hopper GPU. According to the official Weibo of Interface News, on February 24, DeepSeek's "Open Source Week" officially started, planning to open source multiple code repositories to share its research progress in the field of Artificial General Intelligence (AGI) with the global developer community in a completely transparent manner. Reviewing these five days, the first thing they open sourced was FlashMLA, an efficient MLA decoding kernel optimized for NVIDIA's Hopper GPU, designed to handle variable length sequences. In tasks such as natural language processing, data sequence lengths vary, and traditional processing methods can lead to computational waste. FlashMLA, like an intelligent traffic scheduler, can dynamically allocate computational resources based on sequence length. For example, when processing long and short texts simultaneously, it can accurately allocate appropriate computing power to texts of different lengths, avoiding situations of inefficient resource allocation or insufficient resources. Within 6 hours of its release, the number of stars on GitHub exceeded 5000 times, and it is considered to have significant implications for improving the performance of domestic GPUs.
DeepSeek Open Source Week Day 2
Open sourced DeepEP, an open source EP communication library used for MoE training and inference. The second day saw the release of DeepEP. DeepEP is the first open source EP communication library used for training and inference of MoE (Mixture of Experts) models. In MoE model training and inference, different expert models need to collaborate efficiently, which requires high communication efficiency. DeepEP supports optimized all-to-all communication mode, like building a smooth highway for data to be efficiently transmitted between nodes. It also natively supports FP8 low-precision operation scheduling, reducing computational resource consumption, and supports NVLink and RDMA both within and between nodes, with high-throughput kernels for training and inference prefilling and low-latency kernels for inference decoding. In simple terms, it enables faster communication between different parts of the MoE model, with less consumption, improving overall operational efficiency.
DeepSeek Open Source Week Day 3
Open sourced DeepGEMM, a matrix multiplication acceleration library. On the third day, DeepGEMM, a matrix multiplication acceleration library, was open sourced to support V3/R1 training and inference. Generic matrix multiplication is the core of many high-performance computing tasks, and its performance optimization is key to reducing the cost and improving efficiency of large models. DeepGEMM adopts the fine-grained scaling technology proposed in DeepSeek-V3, and it achieves concise and efficient FP8 generic matrix multiplication in just 300 lines of code. It supports both regular GEMM and expert-mixed (MoE) group GEMM, with the highest computational performance on HopperGPU reaching 1350+ FP8 TFLOPS (trillion floating point operations per second), with comparable performance to expert-tuned libraries on various matrix shapes, and in some cases even better, without the need for compilation during installation, compiling all kernels at runtime through lightweight JIT modules.
DeepSeek Open Source Week Day 4
Open sourced DualPipe and EPLB for optimized parallel strategy. DualPipe is a bidirectional pipeline parallel algorithm for computing and overlapping communication in V3/R1 training. Traditional pipeline parallelism suffers from "bubbles," where there is waiting time between calculation and communication phases, leading to resource waste. DualPipe addresses this by overlapping the forward and backward calculation communication stages, raising hardware utilization by over 30%. EPLB is an expert parallel load balancer for V3/R1, based on the Mixture of Experts (MoE) architecture. It duplicates high-load experts using a redundant expert strategy and combines heuristic allocation algorithms to optimize load distribution among GPUs, reducing GPU idle time.
DeepSeek Open Source Week Day 5
Open sourced parallel file system 3FS to improve efficiency of AI model training and inference. On the fifth day, DeepSeep introduced 3FS, a pusher designed for full data access to improve AI model training and inference efficiency. Additionally, DeepSeek also open sourced the data processing framework Smallpond based on 3FS, which further enhances data management capabilities for more convenient and efficient data processing. Global developers can use these open source projects for secondary development and improvement, potentially advancing the application of AI technology in more areas.
DeepSeek Open Source Week Day 6
Introduced the inference system of DeepSeek-V3/R1, with a (theoretical) profit margin of up to 545%. According to the official WeChat public account of Machine Heart, on March 1, DeepSeek's official X account was updated again, announcing that the "Open Source Week" is still ongoing. However, on the sixth day, DeepSeek did not open source any new software libraries, but instead introduced the inference system of DeepSeek-V3/R1. The inference system of DeepSeek-V3/R1 uses cross-node EP-driven batch expansion, computation-communication overlap, and load balancing to optimize throughput and latency. DeepSeek also provided statistics on its online services: each H800 node achieves 73.7k/14.8k tokens per second input/output; with a theoretical profit margin of up to 545%. After analyzing all user requests from web pages, apps, and APIs, if all tokens are priced according to DeepSeek-R1 (0.14 USD per million input tokens (cache hit), 0.55 USD per million input tokens (cache miss), 2.19 USD per million output tokens), the total daily revenue would be 562,027 USD, with a profit margin of 545%. However, DeepSeek stated that the actual income is significantly lower than this number, for the following reasons: the pricing of DeepSeek-V3 is significantly lower than R1, and only some services are monetized (web page and app access remain free).Free, automatically applying night discounts during off-peak hours.Risk Notice: The technological development may not meet expectations, and the company's business expansion may not meet expectations.