HLE's "Humanity's Final Exam" breaks 60 points for the first time! Eigen-1 based on DeepSeek V3.1 significantly outperforms Grok4 and GPT-5.

28/09/2025

Recently, the Eigen-1 multi-agent system developed by a team composed of Tang Xiangru and Wang Yujie from Yale University, Xu Wanghan from Shanghai Jiao Tong University, Wan Guancheng from UCLA, Yin Zhenfei from Oxford University, Gold Emporer and Wang Hanrui from Eigen AI achieved a historic breakthrough. On the HLE Bio/Chem Gold test set, the Pass@1 accuracy reached 48.3%, and the Pass@5 accuracy soared to 61.74%, surpassing the 60-point mark for the first time. This achievement far exceeds Google's Gemini 2.5 Pro, OpenAI's GPT-5, and Grok 4. What is most exciting is that this achievement is not dependent on closed-source large models, but is completely built on the open-source DeepSeek V3.1.