Conversational AI is about to explode.
10/03/2025
GMT Eight
In the era of generative AI, the industry generally believes that multimodal large models are the key to achieving AGI, and the latest report on Voice AI from well-known investment firm a16z also shows that with the continuous progress of large models, speech will become a critical entry point for conversational AI.
As conversational artificial intelligence technology matures, its applications are experiencing explosive growth. Chat Siasun Robot&Automation, as an important application of conversational AI, is widely used in customer service, education, healthcare, entertainment, and many other fields.
So, in which field and scenario will the explosive growth of conversational AI occur first?
Recently, at the speech-enabled AI engine conference, Alibaba Cloud Intelligence Group's Senior Product Architect Xin Xiaojian, Minimax Solution's Senior Director Feng Wen, Tencent Cloud's AI Product Architect Cao Chao, and the head of the AIRTE product line at Lingling participated in the discussion.
Several attendees believe that conversational AI may first explode in scenarios like desktop assistants, mobile assistants, smart devices, and companions like Siasun Robot&Automation.
Cao Chao, the AI Product Architect at Tencent Cloud, said that the unique advantage of conversational AI lies in its ability to convey emotions and warmth through voice and interaction, and with model upgrades, it can bring more emotional communication.
"In terms of application scenarios, conversational AI is not suitable for visual scenes, so it is more focused on voice and auditory interaction scenarios. For example, some elderly people may have visual impairments, and using WeChat involves speaking by long-pressing and listening by placing the phone close to the ear. These people also need tools to communicate and solve problems. Conversational AI opens up new opportunities and possibilities for these groups. Currently, the hardware perspective of conversational AI is largely based on mobile phones."
Alibaba Cloud Intelligence Group's Senior Product Architect Xin Xiaojian added, "Education learning machines are also a good scenario. Currently, the annual shipment volume of learning machines nationwide is approximately 60 million units, and with the support of large models, the unit price has seen a significant increase. Previously, the unit price of learning machines was around three to four thousand yuan, but now, the average price of slightly better quality learning machines online has reached over eight thousand yuan. This is the premium space brought by conversational AI."
It is understood that the current conversational AI products on the market include Amazon Alexa+, Zhejiang Jinke Tom Culture IndustryAI emotional companion Siasun Robot&Automation, Apple Siri, Manus, and others.
Recently, Lingling released the world's first conversational AI engine. With five major capabilities such as 650ms ultra-low latency response, elegant interruption, full model adaptation, the conversational AI engine can quickly upgrade any text large model to a "talkative" conversational multimodal large model.
Yao Guanghua, head of the Lingling AI RTE product line, said, "After a period of polishing with customers and actual scene research, statistically, with each dialogue generated between users and AI, there are about 3 rounds of Q&A on average, resulting in an average dialogue duration of about 21.1 seconds, with a cost of only 3 cents per session. If there are 15 dialogues per month, the monthly cost is less than 50 cents, and the annual cost is only 5 yuan."
Through the Lingling conversational AI engine, developers can quickly deploy intelligent assistants, virtual companions, oral companions, intelligent customer service, intelligent hardware, and other conversational AI scenarios. For example, in the intelligent assistant scenario, natural language interaction can help people with schedule management, information retrieval, and task execution.
Regarding the key aspects of large models transitioning from text to multimodal interaction, the guests believe that there are not many changes in multimodal model architecture and training paradigms, and the primary improvement relies on data quality and quantity. The key to achieving multimodal interaction lies in converting different modal information into the same context, and the development of ASR (automatic speech recognition) technology helps achieve this. However, to improve the interaction experience, it is necessary to enhance the model's inference speed, solve engineering issues such as multi-role long-short-term memory, role distinction, and address complex situations in different modal interactions such as speech semantic differences and video processing.
In addition, the attendees generally believe that DeepSeek breaking out is a good thing, as it breaks the boundaries of AI technology and attracts more attention to AI. Its open-source nature is significant for technological development, promoting technical exchange and innovation, and involving more people in AI exploration. Technically, Deep Seek brings new thinking to the industry, such as reducing reliance on large amounts of data in model training, achieving upgrade iteration through reinforcement learning for model self-evolution, reducing computing power requirements, and making AI more accessible. Furthermore, it validates the commercial model of model APIs and advances application development paradigms.
Feng Wen, Senior Director of Minimax Solutions, said that DeepSeek's emergence from the circle is a good phenomenon for all practitioners in the AI industry; compared to before, AI has now imperceptibly entered a larger user base. "Indeed, open source will greatly help the technology break through because of DeepSeek's open-source nature. Recently, in the technical reports we released, we actively showcase the latest achievements to the public."
This article was reprinted from CaiLiShen, GMTEight Editor: Chen Wenfang.