A thousand questions officially open sourced FlashQLA, which reduces the computational cost of the attention layer in the training and inference processes.

date
19:37 29/04/2026
avatar
GMT Eight
On April 29th, Qianwen Big Model announced the official open sourcing of FlashQLA, a high-performance linear attention operator library implemented based on TileLang.
On April 29th, Qwen's large model announced the official open-source release of FlashQLA, a high-performance linear attention operator library based on TileLang. FlashQLA merges and optimizes the forward and backward calculations of GDN Chunked Prefill operators in a reasonable way, achieving 2-3 forward acceleration and 2 backward acceleration compared to FLA Triton Kernel on NVIDIA Hopper in multiple scenarios. This significantly improves the efficiency of pre-training scenarios and edge-side agentic inference. The Qwen team stated that since the release of Qwen3-Next, the Gated Delta Network (GDN) has become the main attention layer in the entire Qwen series, extending from Qwen3-Next-80B-A3B to the subsequent Qwen3.5/Qwen3.6 series. As the model scale expands to 397A17B, 122A10B, 35B, 27B, the overhead of GDN in end-to-end training and inference has become significant. The core highlights of this release include: Gate-driven automatic intra-card sequence parallelism. By utilizing the exponential decay property of GDN gate, FlashQLA automatically enables intra-card sequence parallelism in TP, long sequences, and small head scenarios, increasing the GPU SM utilization; Hardware-friendly algebraic rewriting. Some modifications are made to the forward and backward processes of GDN Chunked Prefill, effectively reducing the overhead of Tensor Cores, CUDA Cores, and SFUs without affecting numerical precision.