ON MARCH 18, the Delhi High Court had yet another hearing of a case filed by the news agency ANI against OpenAI, which owns ChatGPT. ANI is not alone in it. A number of other media organisations from the country have intervened. Abroad, it is the same story. The New York Times has a similar lawsuit against OpenAI. These cases will decide how much ownership humans can have over their information against artificial intelligence (AI). The clarity is needed because the sudden explosion of the technology and its nature is presenting novel and perplexing issues. For instance, usually if a person uses someone else’s work without permission, it is copyright violation. But if it is in public domain and is only used as inspiration to present something new and unlike the original, it is considered fair use. That however applies to humans.
AI products like ChatGPT are based on what is termed Large Language Models (LLMs). LLMs come up with responses to prompts while also mimicking humans in how they do it. They are not thinking up a new answer as a human being would do, instead they are predicting what ought to be the answer based on probability. To do that they need to be trained on phenomenal amounts of data present online to find patterns. A part of this data comes from what mass media have put out. Because the technology was new and not public, media organisations, or anyone else who had data online, was unaware this training was happening. Once the launch of ChatGPT unfurled the generative AI era, it quickly became clear that the models had been using for free what others had spent money to create. Companies like OpenAI, Google or Meta, which are the owners of these LLMs, are also in the business of profit. But they had been getting their raw material, which is online information, for gratis.
Organisations like ANI and the New York Times argue that they put in money and effort for their data and, just because it is available online, this doesn’t mean it can be used by other businesses without agreement. OpenAI’s position is that the material was public and the output is something entirely different. Its statement on the issue usually says: “We build our AI models using publicly available data, in a manner protected by fair use and related principles, and supported by long-standing and widely accepted legal precedents.”
The Delhi High Court, in the first hearing, noted that the suit raised new legal issues because of technological advancements. It spelt out the questions before it as whether ANI’s copyright was infringed in either training the AI or the responses that it came out with for the users; whether it was ‘fair use’ on OpenAI’s part. And there was also the question of whether an Indian court had jurisdiction at all, since OpenAI didn’t operate here as a company even though Indians used ChatGPT. Incidentally, once ANI had filed the case last year, OpenAI stopped using its data for training their models. But, as the hearing this week indicated, that did not mean it had stopped happening because ANI, as a news agency, gives it reports to publications who are their customers and OpenAI could still be training on those publications’ data, an indirect breach. The legal webzine Bar and Bench quoted ANI’s lawyer stating at the hearing, “While they crawl from my (ANI’s) website, they also crawl material from my subscribers. While they say that they don’t take content from me, they are taking from my subscribers. I do not cease to have control over copyright. I don’t divest control… Merely because I have licensed contact content for public viewing by a person who has paid subscription or a license fee, I do not cease to have control over that content.”
The issue also includes unexpected elements. AI models are prone to a phenomenon called hallucination, where it makes up facts. Sometimes it does so attributing it to a source and that is one of the argument that the New York Times has used in its copyright infringement case. As an article the newspaper published after filing the case said, “The lawsuit also highlights the potential damage to The Times’s brand through so-called A.I. ‘hallucinations,’ a phenomenon in which chatbots insert false information that is then wrongly attributed to a source. The complaint cites several cases in which Microsoft’s Bing Chat provided incorrect information that was said to have come from The Times, including results for ‘the 15 most heart-healthy foods,’ 12 of which were not mentioned in an article by the paper.”
The New York Times also says ChatGPT sometimes reproduced verbatim their articles in its answers. Its complaint gave an example of an investigative story it had spent huge resources on. “For example, in 2019, The Times published a Pulitzer-prize winning, five-part series on predatory lending in New York City’s taxi industry. The 18-month investigation included 600 interviews, more than 100 records requests, large-scale data analysis, and the review of thousands of pages of internal bank records and other documents, and ultimately led to criminal probes and the enactment of new laws to prevent future abuse. OpenAI had no role in the creation of this content, yet with minimal prompting, will recite large portions of it verbatim,” it said.
It is not just in AI’s text output that the issue of copyright is felt. AI models generate images and videos too based on prompts. Photographers have long been complaining that their images were used in training such models. Last year, they came together to make an app called Overlai which protects their work from being used for AI training. Paul Nicklen, one of the world’s leading conservation photographer and a cofounder of the app, put up an Instagram post that said, “I am not willing to just hand over the last 30 years of hard work so giants can make an even bigger profit. Overlai addresses this by protecting our work with C2PA and IPTC assertions and creator credentials.” The photo agency Getty Images has filed a case against Stability AI accusing it of using 12 million photos for training. A group of photographers also filed a class action lawsuit in the US against four companies—Stability AI, DeviantArt, Midjourney, and Runway AI—that specialise in image generation. While joining the suit, Singaporean artist Jingna Zhang wrote on her website, “Copyright and its protections made the professional pursuit of my craft possible. But the rapid commercialization of generative AI models, built upon the unauthorized use of billions of images—both from artists and everyday individuals—violates that protection. This should not be allowed to go unchecked.”
The copyright issue also inverts itself for works made from AI. If it is not legally clear whether AI models flouted copyright of data owners, neither is it certain what they generate has copyright protection. In India, an interesting case around this happened even before the AI era took off in 2022 with ChatGPT’s launch. In 2020, a lawyer from India specialising in intellectual property commissioned a machine language tool that trained on famous artworks. It could then come up with its own paintings based on uploads of photographs. The tool was called RAGHAV. The lawyer took a photograph of a sunset shot, uploaded it and got RAGHAV to come up with an artwork called Suryast. It was similar in style to Vincent van Gogh’s work. He applied for registering it with the copyright office in India and got it through as a joint author along with RAGHAV. This became controversial because there was no precedent and the next year the copyright office withdrew permission. The question of whether anything generated through AI—texts, artworks, video clips, music— can be owned and protected against duplication by others remains a nebulous area.
Lawsuits across the world will eventually address these concerns and provide a legal framework. The era of freely lifting data is however probably on its way out. AI companies now make partnerships with media organisations. What might be irking Indian organisations about OpenAI is that it has arrangements with numerous media organisations abroad but not here. The company states it’s in the process of doing them. That still only addresses organisations that have the wherewithal to enforce their rights. Millions of individuals, like freelance photographers, for instance, whose creative works were used for training AI models in the past won’t receive anything. It is not even possible for companies to come up with funding for such volumes. On the flip side as laws bring in clarity, new players who want to get into building AI models will find it difficult because they would have to pay for training data which the early birds got for free. Or they might have to do what OpenAI, ironically, is accusing the Chinese company DeepSeek, which burst on the scene recently, of doing—using their data without permission to become a competitor.
More Columns
Temple Paradox Ullekh NP
Getting Closer to Europe Amit Kapoor
Priyanka Bose: Time Does Tell Kaveree Bamzai