Stability AI gets into the video-generating game
AI startups that aren’t OpenAI are plugging away this week, it’d seem — sticking to their product roadmaps even as coverage of the chaos at OpenAI dominates the airwaves.
See: Stability AI, which this afternoon announced Stable Video Diffusion, an AI model that generates videos by animating existing images. Based on Stability’s existing Stable Diffusion text-to-image model, Stable Video Diffusion is one of the few video-generating models available in open source — or commercially, for that matter.
But not to everyone.
Stable Video Diffusion is currently in what Stability’s describing as a “research preview.” Those who wish to run the model must agree to certain terms of use, which outline the Stable Video Diffusion’s intended applications (e.g. “educational or creative tools,” “design and other artistic processes,” etc.) and non-intended ones (“factual or true representations of people or events”).
Given how other such AI research previews — including Stability’s own — have gone historically, this writer wouldn’t be surprised to see the model begin to circulate the dark web in short order. If it does, I’d worry about the ways in which Stable Video might be abused, given it doesn’t appear to have a built-in content filter. When Stable Diffusion was released, it didn’t take long before actors with questionable intentions used it to create nonconsensual deepfake porn — and worse.
But I digress.
Stable Video Diffusion comes in the form of two models, actually — SVD and SVD-XT. The first, SVD, transforms still images into 576×1024 videos in 14 frames. SVD-XT uses the same architecture, but ups the frames to 24. Both can generate videos at between three and 30 frames per second.
According to a whitepaper released alongside Stable Video Diffusion, SVD and SVD-XT were initially trained on a dataset of millions of videos and then “fine-tuned” on a much smaller set of hundreds of thousands to around a million clips. Where those videos came from isn’t immediately clear — the paper implies that many were from public research datasets — so it’s impossible to tell whether any were under copyright. If they were, it could open Stability and Stable Video Diffusion’s users to legal and ethical challenges around usage rights. Time will tell.
Whatever the source of the training data, the models — both SVD and SVD-XT — generate fairly high-quality four-second clips. By this writer’s estimation, the cherry-picked samples on Stability’s blog could go to-to-toe with outputs from Meta’s recent video-generation model as well as AI-produced examples we’ve seen from Google and AI startups Runway and Pika Labs.
But Stable Video Diffusion has limitations. Stability’s transparent about this, writing on the models’ Hugging Face pages — the pages from where researchers can apply to access Stable Video Diffusion — that the models can’t generate videos without motion or slow camera pans, be controlled by text, render text (at least not legibly) or consistently generate faces and people “properly.”
Still — while it’s early days — Stability notes that the models are quite extensible and can be adapted to use cases like generating 360-degree views of objects.
So what might Stable Video Diffusion evolve into? Well, Stability says that it’s planning “a variety” of models that “build on and extend” SVD and SVD-XT as well as a “text-to-video” tool that’ll bring text prompting to the models on the web. The ultimate goal appears to be commercialization — Stability rightly notes that Stable Video Diffusion has potential applications in “advertising, education, entertainment and beyond.”
Certainly, Stability’s gunning for a hit as investors in the startup turn up the pressure.
In April, Semafor reported that Stability AI was burning through cash, spurring an executive hunt to ramp up sales. According to Forbes, the company has repeatedly delayed or outright not paid wages and payroll taxes, leading AWS — which Stability uses for compute to train its models — to threaten to revoke Stability’s access to its GPU instances.
Stability AI recently raised $25 million through a convertible note (i.e. debt that converts to equity), bringing its total raised to over $125 million. But it hasn’t closed new funding at a higher valuation; the startup was last valued at $1 billion. Stability was said to be seeking quadruple that within the next few months, despite stubbornly low revenues and a high burn rate.
Stability suffered another blow recently with the departure of Ed Newton-Rex, who had been VP of audio at the startup for just over a year and played a pivotal role in the launch of Stability’s music-generating tool, Stable Audio. In a public letter, Newton-Rex said that he left Stability over a disagreement about copyright and how copyrighted data should — and shouldn’t — be used to train AI models.