CITIC SEC: Optimistic about the complete machine link of hypernode servers, suggest paying attention to related companies in the industry chain.

date
08:45 19/12/2025
avatar
GMT Eight
CITIC Securities is optimistic about the future development of the super node server whole machine segment and recommends paying attention to related companies in the industry chain.
CITIC SEC released a research report stating that the super-node solution is expected to quickly increase in volume. As the basic computing unit for future underlying AI infrastructure, super-nodes in the scale-up domain possess advantages such as efficient communication bandwidth and native memory semantics, naturally adapting to the computation of mainstream MoE architecture models. The "decoupling" of super-nodes at the system level enhances the overall system value, but also presents more challenges in terms of multi-chip power consumption, heat dissipation, and cabinet reliability. CITIC SEC believes that super-nodes are expected to increase the value of the entire system through higher technical additions. CITIC SEC is optimistic about the future development of super-node server systems and recommends paying attention to related companies in the industry chain. The main viewpoints of CITIC SEC are as follows: The MoE architecture model poses new hardware requirements, leading to the emergence of Scale-up super-nodes. With the development of Scaling law, mainstream AI large models pursue larger parameter scales and higher operational efficiency, commonly adopting the MoE (Mixture of Experts) architecture. Thanks to the unique structure of expert networks, they naturally adapt to the parallel computation mode of experts, which can effectively optimize computational and memory access bottlenecks. However, this introduces new communication challenges, leading to the emergence of super-nodes based on the Scale-up network. Compared to traditional eight-card servers, super-nodes face more complex systemic challenges, such as the heat dissipation pressure brought by the collaboration of a large number of chips, stability issues arising from optical and copper hybrid interconnections between multiple chips, and long-term reliability risks of multiple components. Such issues often require deep collaboration between server manufacturers and upstream manufacturers across all phases in order to explore the optimal solution, significantly enhancing the discourse power of the entire system in the industrial chain. Competition among various domestic and overseas super-nodes, with domestic super-nodes surpassing in certain technical areas. Overseas super-nodes mainly use NVIDIA NVL72 as the primary solution, and Google Ironwood Rack uses Google's self-developed TPUv7 chip, supporting a cluster expansion of up to 9,216 chips. Recent domestic super-node solutions such as Huawei Cloud Matrix384, Alibaba Panjiu, and Dawning ScaleX640 have all emerged. We believe that this is still an early stage in the development of various super-node solutions, with super-nodes as the fundamental unit of future underlying AI infrastructure gradually converging towards a limited number of directions. In terms of computing density, the current scale of Scale-up has not yet reached a clear conclusion. Larger Scale-up domains are expected to bring performance benefits in model training and inference, but the issue still relies on technological development and cost-effectiveness factors to provide an answer. In terms of network topology, different topologies such as fat-tree architecture and 3D-Torus have their own advantages and disadvantages. We believe that fat-tree structures may occupy a higher market share in the short term from a general perspective, and large manufacturers with both hardware and software development capabilities are expected to explore solutions such as 3D-Torus for convenience. In terms of physical connections, we believe that a no-backplane orthogonal connection has advantages in connection simplicity and cabinet compactness, which may become the mainstream technology solution for future super-nodes. In terms of heat dissipation, as the computing density of a single cabinet gradually increases, liquid cooling solutions with a PUE closer to 1 may have greater development opportunities. If solutions such as phase-change immersion cooling can solve issues such as stability, they may see wider application on a larger scale. The "decoupling" of super-nodes enhances the system value and further highlights the technical value addition. In the past, AI servers mainly in the form of eight cards had a clear division of labor in the industry chain, with mature and stable processes in each phase. Server manufacturers primarily assembled standardized components efficiently for product delivery, with technological thresholds focusing on individual component levels. However, the technical complexity of super-node servers has reached a qualitative leap: challenges such as power control from multi-chip collaboration, heat dissipation under high-density integration, and long-term reliability assurance at the system level are unprecedented. This evolution turns server manufacturers into core "system integrators" in the AI computing industry, as super-nodes are essentially integrated computing systems requiring deep consideration of the coupling relationships between chips, heat dissipation, and interconnections from the design stage. This systemic and integrated design and integration demand significantly raise the technical threshold for super-node servers, strengthening the discourse power of the entire system in the industrial chain and becoming the core hub for grasping technical directions and system performance, with its technical value addition expected to gradually emerge. Risk factors: Risks related to disruptions in the supply chain caused by computing chip technology; risks associated with insufficient chip production capacity; risks related to capital expenditures of internet giants falling below expectations; risks related to industry policies falling below expectations; risks related to slower-than-expected development of AI applications; risks related to slower-than-expected technological iterations of computing chips; risks of intensified competition among domestic GPU manufacturers, and so on. Investment strategy: The technology of super-nodes is flourishing, with the MoE architecture expected to become the mainstream architecture for large models. The uniqueness of MoE architecture poses new adaptability requirements for hardware development, with Scale-up super-nodes expected to bring more excellent solutions through efficient network communication and native memory semantics. We expect super-nodes to become the fundamental computing unit of future AI infrastructure. At this point, super-node solutions both domestically and overseas are competing, and while there are differences in network topology and communication protocols, we believe there is high certainty in the development of computing density, heat dissipation capability, stability, reliability, and other aspects. The related technology brings new requirements for the production and manufacturing of server systems, and server manufacturers with custom development capabilities and supply chain management capabilities are expected to gain greater development opportunities. It is recommended to pay attention to related companies in the industry chain.