First disclosure of the GLM-5 Coding Agent inference engineering practice in the Knowledge Atlas (02513).

date
09:09 30/04/2026
avatar
GMT Eight
Intellectual Spectrum (02513) public account released an article, revealing for the first time the breakthrough in the underlying reasoning technology of the GLM-5 series model in the ultra-large scale Coding Agent calling scene.
The public account KNOWLEDGE ATLAS(02513) published an article, systematically disclosing the breakthrough in the underlying reasoning technology of the GLM-5 series model in the large-scale Coding Agent call scenario. This includes the identification and fixing of two key bugs, one performance optimization innovation, and an unexpected breakthrough in monitoring mechanisms. In response to the redundancy storage issue in the KV Cache of the Context Parallel strategy, KNOWLEDGE ATLAS designed and implemented a layered storage scheme for the KV Cache called LayerSplit. This optimization directly improves the service capacity limit of KNOWLEDGE ATLAS in the Coding scenario. In addition, the company's reasoning optimization is further accelerating, significantly improving the efficiency of unit computing token throughput, and reducing reasoning costs. KNOWLEDGE ATLAS stated that when intelligence truly enters the high concurrency, long context of the Coding Agent scene, the challenge of the reasoning infrastructure is not only throughput, latency, and availability, but also the maintenance of its output quality becomes crucial. Every pursuit of Scaling Law must have an equally strong system engineering support. After weeks of deduction, troubleshooting, and stress testing, the company finally identified and corrected several independent underlying race conditions bugs, and made targeted optimizations to the system bottlenecks reflected in them, significantly improving the stability and efficiency of the reasoning system. The engineering breakthrough disclosed this time has clear technical depth - the team not only identified and fixed the KV Cache cross-node reuse race condition under the PD separation architecture in their reasoning pipeline, but also discovered and fixed the lack of loading sequence(read-before-ready) of the HiCache module at the source code level of the mainstream open-source reasoning framework SGLang. The fix solution was adopted by the SGLang open-source community, and its underlying infrastructure capabilities not only serve their own models but are also becoming one of the common infrastructures in the large model industry. From offline reproduction to anomaly recognition Since March, three types of exceptional phenomena have been observed in the GLM-5's online monitoring and user feedback: garbled output, repetition, and rare characters. These phenomena on the surface are similar to the common "falling intelligence" in long context scenes, but since KNOWLEDGE ATLAS has not implemented any optimization to reduce model accuracy, a more critical issue is: do exceptions originate from the model itself or from the reasoning pipeline? If it is from the model, the exception will manifest as stable and reproducible behavior for specific inputs; otherwise, if the anomaly is related to system pressure or runtime state, it is more likely to point to link or state management issues in the reasoning infrastructure. In the initial troubleshooting stage, the company replayed bad cases reported by users locally and repeatedly reasoned the same batch of requests hundreds of times, but they were unable to reproduce the anomaly, indicating that it is likely not a problem with the model itself. To further simulate the pressure of the online environment, the company desensitized the online logs and tried to maintain the original concurrency distribution and request sequence as much as possible for a full replay locally. Initially, anomalies were still not reproduced until further adjustment of the PD separation ratio and continuous increase of system load to simulate the peak period of Prefill accumulation and Decode-side KV Cache pressure, which finally stabilized and reproduced the anomalies 3-5 times in about every ten thousand requests. This characteristic of being "unrelated to request content but related to system pressure" indicates that the problem may originate from reasoning management under high load. At the same time, the frequency of anomalies reproduced offline was still lower than the frequency of feedback online, indicating that existing detection methods may have missed some issues or there are still scenarios that have not been covered. How to reliably identify abnormal outputs became a new challenge. Among the three types of anomalies, repetition is relatively easy to detect, while garbled output and rare characters are more tricky. The company tried heuristic methods such as regular expressions, character set matching, as well as model-based discrimination, but the former had significant omission and misjudgments, and the latter was difficult to meet the efficiency requirements of large-scale extinguishing experiments. These limitations made anomaly detection itself a bottleneck in the troubleshooting process. Figure 1: Speculative sampling metrics can be an important reference for anomaly detection After repeatedly analyzing the reasoning logs, KNOWLEDGE ATLAS discovered an unexpected entry point: speculative decoding metrics can be an important reference for anomaly detection. Speculative decoding was originally a performance optimization technique that generated candidate tokens from a draft model, which were then verified and accepted by the target model to improve decoding efficiency without changing the final output distribution. As shown in Figure 1, two metrics (spec_accept_length: the length of the draft token prefix continuously accepted by the target model; spec_accept_rate: the proportion of draft tokens accepted) show stable patterns when anomalies occur: Garbled output and rare characters: usually accompanied by very low spec_accept_length, indicating that the candidate tokens generated by the draft model are almost all rejected by the target model, indicating a significant deviation between the KV Cache state seen by the target model and the expected state by the draft model. Repetition: usually accompanied by a high spec_accept_rate, indicating that a damaged KV Cache may degrade the attention pattern, leading to a repetitive cycle with high confidence. Based on the observations above, the company further implemented a set of online anomaly monitoring strategies: when spec_accept_length remains below 1.4 and the generation length exceeds 128 tokens, or when spec_accept_rate exceeds 0.96, the system actively stops the current generation and asks the load balancer to retry the request. This strategy turns speculative decoding from a mere performance optimization technique into a real-time monitoring signal for output quality, becoming a key tool in subsequent eradication experiments. BugFix#1: KV Cache Race Condition under the PD Separation Architecture After observing a clear correlation between abnormal output and concurrent pressure, the company further analyzed the reasons behind it. By analyzing the request lifecycle and the PD separation execution timing in the reasoning engine, it was found that the problem originated from the inconsistency between the request lifecycle and the KV Cache recycling and reuse timing, resulting in a KV Cache reuse conflict. 1. Cause Analysis: KV Cache Race Condition Caused by Asynchronous Abort To limit tail latency, the company introduced a timeout-based request termination mechanism in the reasoning engine: when the Prefill phase is not completed within a specified time, the Decode side will abort the request and reclaim the KV Cache resources it occupies. However, this Abort signal was not correctly propagated to the Prefill side, and the Decode side also lacked sufficient information to judge whether the KV Cache can be safely reclaimed and reused. Therefore, after the Decode Abort and allocating the corresponding KV Cache space to a new request, the previous RDMA writes initiated and Prefill computations still continue to execute without being synchronized and canceled. 2. Fix Solution: Consistency Guarantee of KV Cache Release Timing To eliminate the above race condition, stricter timing constraints were introduced in the reasoning engine, establishing an explicit synchronization relationship between request termination and KV Cache write completion. Specifically, after an Abort is triggered in the Decode side, a notification is sent to the Prefill side. The Prefill side only returns a "releaseable" signal under the following conditions: RDMA writes have not started, or all submitted writes have been completed. Only after receiving this confirmation, the Decode side allows the recovery and reuse of the corresponding KV Cache slot. This mechanism ensures that KV writes do not cross the boundary of memory reuse and avoid KV Cache overwrite between requests. Fix Effect: After this fix was deployed, the occurrence rate of abnormal output decreased from about one in ten thousand to less than three in ten thousand. The results show that in the PD separation architecture, explicit consistency constraints are needed for cross-node data transmission and memory reuse to avoid similar issues. BugFix#2: Sequential Loading of HiCache Coding Agent scenarios significantly increase input length (averaging over 70K tokens) bringing high prefix reuse rates. Such loads make HiCache (multi-level KV Cache) a key optimization method in online services. However, in cases where KV Cache swapping and computation overlap, the current implementation failed to ensure that data has been fully loaded before being accessed, leading to accessing unready KV Cache. 1. Cause Analysis: Read-before-Ready due to Lack of Pipeline Synchronization Through analysis of the HiCache execution sequence, the company located the issue in the DSA HiCache's cache read path. The system asynchronously swaps in historical prefix caches from CPU memory and uses Load Stream and Forward Stream to overlap the execution for increased throughput. As shown in Figure 3(a), the Load Stream is responsible for loading the KV Cache and Indexer Cache while the Forward Stream sequentially performs Index calculation and subsequent Sparse Attention. Ideally, the Indexer calculation in the Forward Stream should start only after the corresponding Indexer Cache has finished loading. However, in the original implementation, this dependency was not explicitly stated. Specifically, the Indexer operator failed to establish a synchronization constraint for the completion of Load Indexer Cache when it started (highlighted by the red dashed line area in Figure 3). Therefore, the Forward Stream could start executing before the data loading was completed, resulting in a read-before-ready access pattern, where data is read before it is fully loaded. This issue could cause the Index calculation to be performed based on incomplete or uninitialized data, affecting the results of the subsequent Sparse Attention computation and ultimately manifesting as output anomalies. 2. Fix Solution: Atomic Refactoring of Operator Pipelines To address this issue, the company modified the HiCache read pipeline (as shown in Figure 3(b)) and introduced explicit synchronization constraints between data loading and computation: Explicit Synchronization Constraints: A synchronous point with the Load Stream was introduced before the Indexer operator starts, ensuring that the corresponding level of Indexer Cache has been fully loaded. The Forward Stream only initiates computation after the data is ready to avoid a read-before-ready access. After this fix was deployed, anomalies caused by inconsistent execution sequences disappeared completely under the same load conditions, and the system behavior became stable. This fix has been submitted to the SGLang community through Pull Request #22811. Optimization: KV Cache Layered Storage LayerSplit The aforementioned two race issues revealed a common system bottleneck: in the long context of the Coding Agent Serving scene, the Prefill phase dominated the system's performance. To control the TTFT brought about by Prefill queuing, KNOWLEDGE ATLAS introduced a timeout Abort; to alleviate the Prefill-side KV Cache capacity pressure, they introduced HiCache. After fixing these consistency issues, the company further returned to the bottleneck itself: how to improve Prefill throughput and reduce Prefill-side KV Cache memory pressure. To address this, the company designed and implemented a KV Cache layered storage solution called LayerSplit. Coding Agent scenarios often exhibit characteristics of long context length and high Prefix Cache hit rate. In this scenario, the Prefill phase often becomes the main performance bottleneck of the system, so Context Parallel (CP) became the main parallel strategy for online Prefill nodes. However, the existing SGLang open-source implementation has a problem with redundant storage of KV Cache, making the limited KV Cache capacity a limiting factor for GPU computing resource utilization. In response to this issue, the company designed and implemented a KV Cache layered storage scheme (LayerSplit). In this scheme, each GPU no longer saves all layers of KV Cache, but only holds part of the layers of KV Cache (as shown in Figure 4(a)), significantly reducing the memory footprint of a single card. During computation, different CP ranks collaborate to complete Prefill according to the scheme shown in Figure 4(b); specifically, a rank holding a certain layer of KV Cache will broadcast that layer of Cache to other related ranks before executing Attention calculation. To reduce communication overhead, further design was done for overlapping KV Cache broadcast and indexer computation, allowing them to overlap in time. Ultimately, only the additional cost of Indexer Cache broadcast was introduced into the entire process, which is approximately 1/8 of the KV Cache size. Therefore, the overall communication cost is low, and its performance impact can be ignored. Figure 5: Improvement in throughput of GLM-5.1 + LayerSplit at different lengths Figure 5 shows the performance improvement under a cache hit rate of 90% in the request length range from 40k to 120k with this optimization. Experimental results show that the system throughput improvement ranges from 10% to 132%, with more significant benefits as the context length increases. Overall, this optimization significantly enhances the system's processing capacity in the Coding Agent scenario.