Why are the videos generated by the domestic large model comparable to Sora? ,Under limited computing power
Recently, a number of multimodal large models developed by Shanghai-based Xiyu Technology were released at Xuhui Binjiang. Dr. Yan Junjie, the company's founder, also delivered a speech at the 2024 Pujiang Innovation Forum Global Venture Capital Conference as a representative of entrepreneurs. The video generated by the large model he played in his speech was quite good. Whether it was a magical skit in the style of the Harry Potter movies or a sci-fi video of astronauts sailing in space on a spacecraft, the experience it brought to the audience was comparable to Sora developed by OpenAI.
Under the condition of limited computing power, how can domestic large models generate high-quality text, pictures, videos, music and voice? Yan Junjie shared his views.
Yan Junjie graduated from the Institute of Automation, Chinese Academy of Sciences. He was the vice president of SenseTime Group and founded Xiyu Technology at the end of 2021. In his opinion, there are currently three important optimization directions for large artificial intelligence models: First, to continuously reduce the error rate of the model, because most models have a high error rate, sometimes amazing performance, sometimes unreliable, which becomes a major bottleneck restricting the model from handling complex tasks; second, to achieve infinite input and output, because this is a human ability, and the computing requirements of large models will soon reach the upper limit that computing power cannot afford as the square of the input and output processing volume increases. This bottleneck requires underlying innovation to break; third, multi-modality, that is, text, sound, pictures, videos and other modalities can be generated to interact with users in various information.
Video generated by MiniMax large model
"How do we overcome technical difficulties in these three areas? We believe that within the same capabilities, faster is better," said Yan Junjie. "Among two models with similar performance, the one with faster training and reasoning can more effectively use computing resources to iterate more data, thereby obtaining better model capabilities. So we believe that faster is better. This is a simple but easily overlooked philosophical concept."
In pursuit of "speed", the MiniMax team has made a number of technical innovations to the large model. MoE is one of the innovations. When this architecture was not yet recognized by most experts, they decided to be the first in China to complete a breakthrough in the core MoE algorithm technology route.
It is reported that the design idea of the hybrid expert model is "specialization in one's field", that is, to classify tasks and then assign them to multiple "experts" to solve. The corresponding concept is the dense model, and the "generalist" model adopts this architecture. Compared with a "generalist", a group of "experts" can complete complex tasks more efficiently and professionally, and can also greatly increase the model capacity without significantly increasing the computing cost, making large models at the trillion-parameter level possible. In the abab-text-6.5s large language model developed by Xiyu Technology, the MoE model is 3-5 times faster than the dense model. This large model can handle billions of interactions every day, and MOE plays a key role.
The LinearAttention mechanism is also a technological innovation carried out by the MiniMax team. Through algorithm optimization, it transforms the quadratic growth relationship between input length and computational complexity in the traditional model architecture into a linear relationship, taking a key step towards "achieving infinite input and output".
Yan Junjie introduced the models and products developed by MiniMax.
Supported by technologies such as hybrid expert models and linear attention mechanisms, the video model abab-video-1 has the characteristics of high compression rate, good text response, and support for native high-resolution and high-frame rate videos, which is comparable to the texture of movies. The music model abab-music-1 supports multifunctional end-to-end music generation and can be used to synthesize pure music, a cappella works and other music forms, and can meet the simultaneous generation of accompaniment and vocals. It is expected to greatly simplify the music recording and creation process, allowing laymen to engage in music creation. Readers can log in to the web version of "Conch AI" to experience the fun of creating videos and music.
Video generated by MiniMax large model
Xiyu Technology has also updated the voice model abab-speech-1, which can generate synthesized speech in multiple languages such as Mandarin, Cantonese, Japanese, Korean, Spanish, etc., with a high degree of anthropomorphism and delicate and natural emotional changes.
Yan Junjie introduced that currently, the MiniMax large model interacts with end users 3 billion times a day, processes more than 3 trillion token texts every day, and generates 20 million images and 70,000 hours of voice.
Video generated by MiniMax large model
The users who interact 3 billion times a day come from both the company's own products such as "Conch AI" and "Xingye" and the company's open platform partners. For example, Kingsoft Office Software cooperated with MiniMax, and through the thinking chain, WPS can display the reasoning steps of the large model when generating document summaries and answering user questions, thereby improving the transparency and credibility of the solution; the mobile office platform "DingTalk" cooperated with it to obtain the ability to generate copy and follow the format, which improved the production efficiency of users; the online literature website "Yuewen" obtained the ability to quickly understand the overall context of the context through cooperation, and can maintain emotional consistency in the production of audio books of novels, and can accurately analyze the emotions of the characters and perform stylized interpretations; the human resources platform "Zhaopin" cooperated with it to use vertical industry and full-time industry data to fine-tune the model, which greatly improved the accuracy of AI interview evaluation, job description information extraction and resume matching.
With the release of video models, music models, and voice models, Xiyu Technology has created a full set of multimodal large model products. Yan Junjie revealed that in the next few weeks, the company will release the multimodal large model abab7, which will be comparable to GPT-4o in speed and effect, and will be tested by partners and end users.