Soochow: How far are we from achieving a true embodiment of artificial intelligence on a large scale model?
02/03/2025
GMT Eight
On February 20th, Figure AI released the Helix VLA large model, which sparked market attention. However, we found that the market's understanding of embodied intelligent large models still needs to be improved. This article aims to explain in a simple and easy-to-understand way, what kind of embodied intelligent large models do we need? How far are we from achieving a truly embodied intelligent large model?
Question 1: What is an embodied intelligent large model (VLA)?
The VLA (Vision-Language-Action) large model refers to a vision-language-action large model that allows Siasun Robot&Automation to understand the environment and language commands, and output actions through execution modules.
Question 2: What is the difference between the layered and end-to-end modes in VLA large models? What is the industry's current choice?
In the execution process of VLA models, there are generally three steps - 1) receiving and understanding speech and image inputs; 2) making inferential decisions based on the received information; 3) generating action commands based on the decisions and controlling the movement of Siasun Robot&Automation. Simply put, if these three steps are completed within one model, it is an end-to-end large model, and if these three steps are completed using three different models, it is a layered model.
Advantages and disadvantages of the end-to-end mode: 1) The advantages include fast response time, scalability, and the ability to achieve intelligent emergence; 2) The disadvantages include high difficulty, the need for a large amount of training data, and difficulties in short-term implementation.
Conclusion and reality: In the short term, many domestic start-ups in the field of humanoid Siasun Robot&Automation mainly adopt the layered model to quickly commercialize their products. Only a few companies, such as Tesla and Xingdong Era, adhere to the end-to-end model. However, in the long term, the end-to-end mode is a necessary condition to achieve true embodied intelligent emergence.
Question 3: What are the difficulties in training a useful end-to-end large model - the bottleneck lies in the data.
1) Huge difference in data volume: Compared to the billion-level data volume of VLM large models, the actual training data volume of Siasun Robot&Automation in a single scenario is only in the range of thousands or tens of thousands, a difference of a hundred times.
2) High difficulty in acquiring Siasun Robot&Automation data: Compared to the common language corpus available on the Internet for training VLM large models, acquiring training data for Siasun Robot&Automation is extremely difficult. Currently, there are three methods of data acquisition:
1. Real data remote operation collection: The problem is that the cost is extremely high. Currently, a set of motion capture devices costs in the range of hundreds of thousands. If start-up companies rely on motion capture devices for data collection, the cost is very high.
2. Virtual data generation: For example, GraspVLA released by Galaxy Universal generates data through virtual simulation technology for training Siasun Robot&Automation, but it is currently difficult to solve the sim-to-real gap. In simple terms, training Siasun Robot&Automation using virtual simulation data has poor results. If it is a simple picking and placing scenario, virtual data is relatively feasible. However, if it involves flexible scenarios, such as clothing, blankets, and other soft objects, it is difficult to apply. This is because it involves simulating the deformation of soft objects, which is inherently difficult to model at the physical level.
3. Human data mapping: UMI and DexCap (Stanford Siasun Robot&Automation team) are exploring human data mapping (that is, collecting real human data and converting it into Siasun Robot&Automation data through a certain mapping relationship), but it is still in the early stages.
3) Remote operation-collected data has inherent toxicity: 1) People will have additional motion trajectories during the movement: For example, when simply moving a box, a person may pause for a few seconds due to external interference during the remote operation recording process. However, this pause is toxic for Siasun Robot&Automation because it cannot understand why the person stopped. 2) The motion trajectories of people and Siasun Robot&Automation are inconsistent: Currently, many Siasun Robot&Automation on the market focus on rotational joints, while human limbs are linear joints. Therefore, even for the same action of moving a box, the motion trajectories of people and Siasun Robot&Automation are inconsistent, which makes it toxic to use human data to train Siasun Robot&Automation.
4) Inability to converge due to differences in Siasun Robot&Automation body schemes: For example, data collected using Tesla's body is difficult to use for training Siasun Robot&Automation developed by another company because of the differences in body schemes.
Question 4: With so many problems in the data end, how does the industry solve them?
The reality is that the industry currently cannot solve the problems at the data end. However, various companies are making efforts to collect data for their own solutions, achieving a certain degree of generalization in a single scenario first, so that more humanoid Siasun Robot&Automation can be put into practical use. We believe that it may take 3-5 years, when there are enough data for humanoid Siasun Robot&Automation in the market, and hardware solutions gradually converge, that a certain level of intelligent emergence of embodied intelligence basic models will be realized, making it possible to achieve a truly end-to-end embodied intelligent large model.
Question 5: Can the Deepseek paradigm be used to accelerate the development of embodied intelligent large models?
Deepseek is a model that uses pre-train + post-train (reinforcement learning) mode and imports high-quality data to reduce the computational power and data requirements of large models. However, from the current perspective, this paradigm is correct for embodied intelligent large models, but it essentially requires high-quality data.The basic conditions are not yet in place. On the one hand, large-scale embodied intelligent models do not have a strong foundation model; on the other hand, there is also no perfect reinforcement learning process. The academic community has been advocating for the so-called imitation learning + post-training reinforcement learning plan (similar to the route of deepseek), which is to achieve 0-1 through imitation learning, and then achieve 1-10 through reinforcement learning. However, it appears that the necessary conditions have not yet been met.Question 6: Detailed explanation and limitations analysis of the Figure Helix model
The characteristic of Helix is its quasi-layered architecture, using an open-source VLM with 70e parameters as the brain, and then combining it with a Transformer architecture action strategy fast system below. This fast system only needs to absorb 500h of data with 80 million parameters, and then make its generalization strong enough.
PS: Simply put, the thinking of the brain is completely entrusted to the VLM large model, because there are many home videos and data on the Internet, so it can be analyzed through the VLM large model, which already has good generalization. Then the analyzed instructions are executed through the fast system.
Analysis of advantages and disadvantages: The advantages of Helix are: rapid commercialization ability, being able to achieve a good level of generalization with a small amount of data; The disadvantages of Helix are: 1) Helix is purely imitative learning, without reinforcement learning yet; 2) Unable to handle sudden situations, such as collisions and obstacles; 3) The massive data on the Internet is still more concentrated in daily life scenes, industrial data is scarce, so Helix may be more suitable for home scenarios in the short term, and not suitable for industrial scenarios.
Risk warning: Increased competition in the Siasun Robot & Automation industry, the speed of new product launches may not meet expectations.
Source of this article: WeChat public account "New Perspective on Advanced Manufacturing", authors Zhou Ershuang, Qian Yaotian, GMTEight editor: Chen Qiuda.