The wonders and dangers of artificial intelligence-generated videos
Madhavankutty Pillai Madhavankutty Pillai | 23 Feb, 2024
(Illustration: Saurabh Singh)
OPENAI IS a unique company in the history of business. Capitalism relies entirely on the lure of profit to make new goods and services available, and so it advances humanity itself in a way that other systems can’t. OpenAI’s profits are capped and its mission is not just the creation of artificial intelligence (AI) products, but also ensuring that the technology itself is safe. Despite not being a Meta, Apple, or Alphabet with their billions in profits, it has managed to be right at the front of the AI revolution. It, more or less, unveiled the era after ChatGPT was launched and for users, it almost seemed like there was a human at the other end answering queries, just the illusion AI is expected to create. It also came out with Dall.E, which could create images out of whatever prompts were given to it, and again, the user was hard put to differentiate between what the system threw up versus what a human mind at the other end could have done with the same prompt. On February 15, came the next big gamechanger—generative video. It was an inevitability but the world of technology was still sufficiently shocked at the quality of what was being churned out.
The name that OpenAI has given for it is Sora, which means sky in Japanese. It was only shown as a demonstration and they said that a select few, including the creative sector people who work with videos, have been given Sora to test. The launch for the public is imminent. On the OpenAI website, there are videos that have been created by Sora and they indicate the length of the leap that is happening. For instance, there is this prompt: “A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.” The video created from it shows exactly that in vivid detail. Over one minute, the woman walks in slow motion while a Tokyo street and its lights gleam in the background. In between, there is a close-up of her face, and not much to decipher that a programme imagined all of it. It is still not as good as the real thing but can pass for a video shot out of a phone with a filter applied to it. Another clip made out of a prompt that sought a “drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach” seemed as authentic as what any videographer would have taken.
The technology, by the company’s own admission, is far from perfect. A series of their videos also show those examples. In one of them, a prompt asking for five wolf pups has the animals sometimes coming out of each other’s bodies. The website states: “The current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark. The model may also confuse spatial details of a prompt, for example, mixing up left and right, and struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.”
But then what is on offer today must be compared to what existed in generative video just in the recent past. Meta had come out with a video generation model through text commands a year ago and the end result was overtly clumsy in terms of visual effect. Another famous example of the time was a Reddit user using ModelScope with a text prompt that asked for Hollywood actor Will Smith to eat spaghetti. It provided a likeness but the face was rubbery and twisting into itself, and the spaghetti wasn’t exactly going down the mouth. This was an example that the tech YouTuber Marques Brownlee showed on his stream while talking about Sora and said: “It is absolutely ridiculous how far we have come in one year.”
Generative videos are a two-edged sword. The biggest concern by far is what will happen once it becomes very good. It opens up a whole new Pandora’s box on safety. Deepfakes, in which images of people are morphed into videos, have already set alarm bells ringing in terms of the consequences they can have. But deepfakes require some level of technical ability. Generative videos sweep that condition away in a blink. Anyone can potentially make a deepfake video by just giving a prompt. Any literate person, including criminals, could make high-quality ones.
The demonstration of Sora, despite it not even being available yet, has led to people asking what impact it can have on politics, especially elections. As The New York Times wrote: “It could also become a quick and inexpensive way of creating online disinformation, making it even harder to tell what’s real on the internet. ‘I am absolutely terrified that this kind of thing will sway a narrowly contested election,’ said Oren Etzioni, a professor at the University of Washington who specializes in artificial intelligence.” Many experts think that even with the level of refinement that Sora is at, people could still be fooled because many aren’t aware that videos can be manipulated like images.
Brookings Institute published a study recently that spoke of how 2024 is going to be a crucial year in politics with a record number of countries holding elections involving just over 40 per cent of the global population. This includes India and the US. Generative AI could have an outsized impact with its influence. The disinformation could be across the spectrum from text to audio to video. One of the recommendations of the study was for election officials to make voters aware of how they might be misled. They wrote: “For video, this could include questions such as: Does the audio look like it is synced to the movements of the person’s mouth? Does the person depicted ever pause? What are the eyes doing during the video? Do gestures and movements seem natural? For images, the questions could include: Do the hands have an unnatural number of fingers? What does the background look like? Are accessories distorted? And do reflections in mirrors converge at a single point? These types of signals are far from infallible, and the best models are quickly learning how to address some of these issues. But, in a historic election year, they may still be useful clues as voters encounter information online.” Imagine, however, the efficacy of such voter education in a country like India with its levels of illiteracy and with access to phones and videos ubiquitous. Is it even possible for such a lesson of awareness to reach a villager in the interiors of the country? And what are the odds that they will doubt fake videos attempting to alter their political decision? The more likely scenario is that the technology will advance faster than voter awareness, and only after it does some damage will awareness grow.
2024 is going to be a crucial year in politics with a record number of countries holding elections involving just over 40 per cent of the global population. This includes India and the US. Generative AI could have an outsized impact with its influence
OpenAI has a number of competitors who will all come with equally advanced models. They will also be forced to hasten launching their products to have the first-mover advantage in the new market that is being created. Old hands like Meta, Alphabet, and Amazon; new companies like Runway and Stability are all in the generative video race. It means that a lot of development will happen and also that it will become harder to ensure safeguards because so many are vying for the same pie. There are also companies in countries like China whose privacy and safety concerns don’t align with the Western model, and they can’t be regulated or policed by anyone except their own governments.
A big market for AI videos will be pornography and it is already taking unusual forms. In January this year, Fortune did a story after speaking to 18 AI developers and founders to find out the impact of AI on pornography. They found a wide range of online AI sex companions on offer and being used. One of the examples of such a product they gave was Chub AI, which made erotic chatbots available to users. They wrote: “On Chub AI, a website where users chat with artificially intelligent bots, people can indulge their wildest fantasies. For as little as $5 a month, users can get teased by a ‘fat lazy goth’ anthropomorphic cat, or flirt with a ‘tomboy girlfriend’ who works at a truck-stop café. They can also visit a brothel staffed by girls under 15. The brothel, advertised by illustrated girls in spaghetti strap dresses and barrettes, promises a chat-based ‘world without feminism’ where ‘girls offer sexual services’. Chub AI offers more than 500 such scenarios, and a growing number of other sites are enabling similar AI-powered child pornographic role-play.” While big companies like OpenAI and Meta have policies in place that prevent the use of the technology for pornography, it has not prevented startups from using those technologies to make their porn offerings. Chub AI is planning to introduce images to go along with their chatbots. It will then only be a matter of time before videos become a mainstay. When pornography goes from text to images to videos, and the technology allows it to be interposed into real people, then not just celebrities but even regular people might find themselves becoming turned into fantasies.
A collateral damage of generative video will also be an enormous number of jobs being lost in the world of video content, which ranges from advertising to filmmaking. For example, even for the trailer of a movie, there is an ecosystem of people commissioned and paid to make it. One of the demonstrations of Sora was a trailer of a science-fiction movie from a mere prompt and it was slick. At some point when all you have to do is feed in movie footage and ask the AI tomake atrailer out ofit, thereisreally no need for too many humans to be around. Or in advertising films, why take the expense of shooting stock footage of scenes with locations or products when all that can be done with text prompts? Last November, Jeffrey Katzenberg, the founder of Dream Works Animation which made movies like Shrek and Kung Fu Panda, at a summit hosted by Bloomberg TV, saidthathe thought that AI would be a creative tool like a new brush or camera. But that still meant jobs sacrificed. According to him, once upon a time, 500 animators were needed to make a movie and he thought it would be done with less than 50 in the near future.
The benefits of the technology are also the flip side of the same coin—it would make production immeasurably cheaper and much faster. It would make video content creation easier. Everyone can show off their filmmaking abilities in their Instagram reels. It would spawn entirely new businesses that every technology does, even if it didn’t set out to do so. After video, generative AI will move into fields like virtual reality, and then that will trigger a new slew of applications. The only question is how easily societies adapt to the new realities being forced onto them by AI. Some will resist but it will be futile. Once, there were labour strikes against the introduction of computers but anyone suggesting it now would be looked upon as suffering from delusions.
More Columns
The Music of Our Lives Kaveree Bamzai
Love and Longing Nandini Nair
An assault in Parliament Rajeev Deshpande