Vidu

Chinese Startup unveils promising text-to-video generator Vidu

They are all set to take on OpenAI's Sora

Chinese startup Shengshu Technology have unveiled their very own text-to-video generator Vidu, in a bid to take on OpenAI’s hotly anticipated Sora. The Chinese competitor has been created in collaboration with Tsinghua University and is capable of generating 16-second video clips at 1080p, in a single click. While this is significantly less than the 60-second limit that Sora will offer, it is still a huge step from China, and we can expect to see growth in the coming months.

Table of Contents

Can Vidu take on Sora?

“Vidu is the latest achievement of self-reliant innovation, with breakthroughs in many areas,” explained Zhu Jun, chief scientist at Shengshu and deputy dean at Tsinghua’s Institute for AI. “It is imaginative, can simulate the physical world, and produces 16-second videos with consistent characters, scenes, and timeline,” Zhu added.

One of the things that sets Vidu apart is its ability to comprehend Chinese elements, something that was demonstrated at the unveiling event, where attendees were treated to AI-generated clips of a panda playing a guitar on grass and a puppy swimming in a pool.

 

Vidu

 

Powering Vidu is a seemingly home-grown visual transformation model architecture called Universal Vision Transformer (U-ViT). This combines two text-to-video AI models: the Diffusion and the Transformer. Now, according to the developers repsonsible for creating Vidu, these models will allow users to create videos that will feature dynamic camera movements, detailed facial expressions, and natural lighting and shadows. “After the release of Sora, we found that it closely aligned with our technical roadmap, which further motivated us to advance our research with determination,” Zhu added.

While Vidu is a promising offering from China, computational power limitations have been the main reason for the country to take the fight to OpenAI. The market has seen an influx of ChatGPT clones, but none have been quite as good as the original. The same it seems, is the case for Vidu. To put things into perspective, Sora requires eight NVIDIA A100 graphics processing units (GPUs) to run for over three hours to produce a one-minute clip. This is incredible processing power for something as simple as a one-minute clip, and the creators of VIDU are currently thinking about ways that can get their hands on better hardware.

However, this will not be easy as the US government recently banned the export of advanced chips manufactured by companies like NVIDIA. These chips, including NVIDIA’s A100 and H100 GPUs, have become highly sought-after components for training AI systems. It remains to be seen of Shengshu Technology can get their hands on better hardware and improve Vidu, but for now, it seems that Sora is going to turn out the ultimate winner.