Alibaba Qwen proposes a new reinforcement learning algorithm GSPO.
According to the paper "Qwen", in order to continuously expand and enhance learning, the Group Sequence Policy Optimization (GSPO) algorithm has been proposed. Different from previous reinforcement learning algorithms, GSPO defines the importance ratio at the sequence level and performs pruning, rewarding, and optimization at the sequence level.
Latest