Here is the English translation:
8 domestic and international AI companies have successively launched new products or models, capable of generating videos over 10 seconds long that are publicly available. Some claim to have achieved ultra-long video generation of up to 2 minutes, igniting a heated 2.0 battle in the AI video generation field.
On one side, ByteDance took the lead in launching the AI video generation product Jimeng, extending the video generation length from the common 3-4 seconds to 12 seconds. Kuaishou, which had been quiet for a long time, suddenly released the Keling large model, with stunning effects that sparked heated discussions across the internet, with the number of people queuing reaching nearly 1 million at one point.
On the other side, startup Luma AI "abandoned 3D for video", entering the field with a high-profile launch of Dream Machine. Veteran player Runway also stepped up, releasing the new Gen-3 model, pushing physical simulation capabilities to new heights.
The financing battlefield is equally intense. Domestically, Aisi Technology and Shengdata Technology have secured billion-level financing since March. Overseas, Pika received $80 million in financing in June, doubling its valuation to $500 million, while Runway is reportedly preparing for financing of up to $450 million.
Sora hit the AI video generation world like a bombshell. After 5 months of intense catch-up, how have domestic and international AI video generation products progressed? Can they compete with Sora? What challenges will they face? Through horizontal experience of available products and discussions with practitioners and creators, Zhidongxi has conducted an in-depth analysis of these issues.
In the actual tests, I could clearly feel that video generation speed has increased, "failures" have significantly reduced, and evolution from simple "PowerPoint-style" panning to motion with angles and action changes. Overall, among the freely available products, Jimeng and Keling performed best in terms of duration, stability, and physical simulation, taking the lead.
In terms of financing, compared to before Sora's release, both the density and amount of AI video generation-related financing have increased significantly, attracting over 4.4 billion in 5 months. This has also driven capital favor for other "upstream and downstream" products in the video production process, such as AI editing and AI lighting. Additionally, multiple new players have entered the field, some securing billion-level funding without even releasing any products or technologies.
I. Technology War: Competing on Duration, High Definition, and Physical Simulation
On February 16th, OpenAI released Sora, revolutionizing the AI video generation field overnight. However, 5 months later, Sora remains a future product, with no clear timeline for when it will be available to the general public.
During this period, domestic and international tech giants and startups have been competing to release new products or model upgrades, most of which are already open to all users. Among them are products with stunning effects, once again changing the landscape of AI video generation. After all, no matter how good Sora is, what value does it have if it can't be used?
According to Zhidongxi's incomplete statistics, at least 8 companies have released new products or models since Sora's launch, all of which are publicly available except for Shengdata Technology's Vidu.
On February 21st, Stability AI officially launched the web version of its AI video generation product Stable Video, open to all users. Although its underlying model, Stable Video Diffusion, was open-sourced in November last year, it still had certain deployment and usage thresholds as a model. Packaging it as a web version made it easier for more users to use simply and conveniently.
On April 27th, Shengdata Technology, in collaboration with Tsinghua University, released Vidu, a large model for long-duration, high-consistency, and high-dynamic videos. It is said to be able to generate videos up to 16 seconds long with 1080P resolution, and can mimic the real physical world.
From the released demos, Vidu has indeed achieved good results in terms of clarity, range of motion, and physical simulation. However, unfortunately, like Sora, Vidu is not yet open. Zhidongxi inquired with Shengdata Technology and learned that the product will start internal testing soon.
On May 9th, ByteDance's Jianying AI creation platform Dreamina was renamed "Jimeng" and launched AI image and video generation functions, supporting the generation of videos up to 12 seconds long.
On June 6th, Kuaishou released the AI video large model Keling and launched it on the Kuaiying App, where users can apply to use it by simply filling out a questionnaire. The Keling large model focuses on high-intensity simulation of physical world characteristics, such as the "eating noodles" problem that has stumped many AI models, which is demonstrated in their provided video examples.
Currently, Keling supports generating videos of fixed durations of 5 and 10 seconds. According to its official website, the model can generate videos up to 2 minutes long, with 30fps frame rate and 1080P resolution. Functions such as video continuation will be launched in the future.
On June 13th, Luma AI, a startup previously focused on AI-generated 3D, announced the launch of its video generation tool Dream Machine, supporting text and image to generate 5-second videos, while also providing a video extension function that can extend an already generated video by 5 seconds at a time.
On June 17th, Runway released its new generation model Gen-3 Alpha version, which was opened to all users for paid access on July 2nd, with subscription fees starting at $15 per month. Gen-3 currently supports text-to-video generation of 5 and 10-second videos, while image-to-video and other controllable tools are not yet available.
On July 6th, HiDream released the HiDream Large Model 2.0 at WAIC, offering 5, 10, and 15-second video generation durations, and adding capabilities such as text embedding generation, multi-shot video generation based on scripts, and IP coherence and consistency.
On July 17th, Haiper AI, a UK AI startup previously focused on AI 3D reconstruction, announced that its AI video generation product Haiper has been upgraded to v1.5, extending the duration to 8 seconds and providing functions such as video extension and image quality enhancement.
From the parameters, these AI video generation products have first achieved significant progress in generation duration, extending the basic generation time from the previous 2-4 seconds to 5 seconds, with more than half supporting durations over 10 seconds, and some products offering extension functions. Currently, among freely available products, Jimeng offers the longest generation time at 12 seconds.
In terms of visual effects, resolution and frame rate have greatly improved, with more products supporting 720P and above, and frame rates approaching 24/30fps. Previously, most products generated videos with resolutions around 1024*576 and frame rates of 8-12fps.
II. Product War: Hands-on Testing of 6 Free "Spot" Products, "Douyin and Kuaishou" Leading
When Sora was just released, Zhidongxi conducted an in-depth experience of 8 AI video generation tools available in China, and at that time, the differences were quite noticeable, with many "failures". (The First "Chinese Version of Sora" Comparison! 15 Companies Face Off, ByteDance Leads)
So after several months of iterative upgrades, how do these players who have already delivered new answers perform? Zhidongxi experienced the newly released or upgraded AI video generation products. For fairness, only free capabilities were tested, and the first generated video was selected for all.
It should be noted that video generation itself has a luck component similar to "drawing cards", and is also closely related to the writing of prompts. Therefore, a small number of cases do not fully represent the model's capabilities.
For the first test, I chose a still life scene, with the prompt: A close-up of tulips bathed in warm sunset light.
Stable Video showed high stability with this prompt, while also achieving high image clarity and rich colors. The motion was mainly based on camera movement.