Battling Digital Plagiarism

March 15, 2024, 06:46 IST/9 min read

AI language models are a double-edged sword as they help users both learn and cheat

(Illustration: Saurabh Singh)

LATE LAST YEAR, the New York Times sued Microsoft-backed, Sam Altman-headed OpenAI, the creator of Large Language Models (LLMs) like ChatGPT and other generative Artificial Intelligence (AI) platforms, for training its models using their published material and, in the process, violating copyright. LLMs, or automated chatbots, need to be trained using existing data that include texts, images, and ideas before they can write essays and respond to queries based on prompts by users. Early in March this year, several authors sued Nvidia Corp, the US-based chip major that has become the poster child of the AI revolution, arguing that its conversational AI toolkit called NeMo used their copyrighted content. According to a Reuters report, authors Brian Keene, Abdi Nazemian, and Stewart O'Nan claimed Nvidia used their copyrighted books without permission to train NeMo.

The two legal cases mentioned above are among numerous lawsuits the makers of generative AI, which essentially create new content based on inputs such as text, images, sounds and ideas, continue to face as they grapple with what researchers describe as multi-modal copyright infringement. With global and local players churning out LLMs and tools that generate images from text prompts, the stress is that original content creators must be acknowledged or compensated or both—perhaps through embedded codes or other means. And that the foremost goal must be to protect creative, journalistic, or academic output from multi-modal copyright infringement by such AI apps that otherwise have zero accountability to the reader, unlike journalists or writers or artists. Simply put, according to several pundits, LLMs should be held to the same standard of accountability of all industries, from medicine to the media. Any use or simulation of existing content has to be monitored and regulated, or else it can lead to grave outcomes, a section of AI watchers contends.

Open Magazine Latest Edition is Out Now!

Wealth Issue 2025

17 Oct 2025 - Vol 04 | Issue 43

Daring to dream - Portraits of young entrepreneurs

Read Now

Indeed, regulations are being brought in Europe as well as in India, especially in the wake of concerns about a massive rise in deep fakes that can be made using text-to-image AI tools and also amid the growth of regional-language LLMs and AI tools besides apps that can create deep fakes with tremendous ease. European Union (EU) lawmakers have passed the world's first comprehensive regulation for AI, titled the AI Act. Although many rules are yet to take effect globally, India is also working on a draft regulatory framework and the government says it plans to unveil it shortly after the elections this year. This means the problems linked to copyright infringement are an evolving subject, and, as of now, researchers, AI experts, and industry insiders insist that AI companies promptly comply with certain transparency and ethical requirements. True, there is no denying that AI can be used in multiple sectors from infrastructure to advanced surgeries, but as regards generative AI apps, certain transparency requirements are in order, suggest officials who work on such policies in India and Europe. According to an article published on the European Parliament's website titled 'EU AI Act: First Regulation on AI', generative AI, like ChatGPT, will have to comply with transparency requirements: disclosing that the content was generated by AI; designing the model to prevent it from generating illegal content; and publishing summaries of copyrighted data used for training.

It adds, "High-impact general-purpose AI models that might pose systemic risk, such as the more advanced AI model GPT-4, would have to undergo thorough evaluations and any serious incidents would have to be reported to the European Commission."

The response to LLMs cuts both ways. There are those like linguist and public intellectual Noam Chomsky who call it high-tech plagiarism and others in academia who believe that the technology is here to stay and can complement research and learning. Regardless, there seems to be a sensible consensus around compliance with transparency laws as envisaged in the AI Act. After all, it is unfair to creators of original content that high-tech companies indulge in what is often called scraping in AI circles—an automatic collection of data—without compensation or acknowledgment.

In her report for Canada-based think-tank Centre for International Governance Innovation (CIGI), Daria Bielik looks at the pros and cons of LLMs on "student learning outcomes and academic performance in higher education settings". While conceding that LLMs offer a unique opportunity to bridge linguistic gaps, assisting non-native English speakers in achieving academic writing fluency and linguistic proficiency at scale, while minimising universities' costs, she writes, "Despite the potential benefits, ethical concerns arise regarding LLMs' fair use in aiding students, particularly concerning plagiarism implications, the risk of over-dependence, and the reliability of AI detection tools". Her argument is cogent: "Plagiarism involves not only copying text verbatim but also appropriating someone else's ideas or work without acknowledgment. LLMs, while not engaging in literal copying, have raised multiple concerns about the potential for plagiarism due to their capacity to summarise and present others' work without proper credit. The inherent nature of LLMs blurs the distinction between original content creation and the re-presentation of existing information, potentially complicating proper sourcing and acknowledgment."

The goal must be to protect creative, journalistic, or academic output from copyright infringement by AI Apps that otherwise have zero accountability to the reader, unlike writers or artists

She favours the integration of LLMs with academia but offers a caveat quoting previous studies: By prioritising education, preventive measures, and clear communication regarding the ethical use of technology, universities should aim to maintain academic standards and integrity while navigating the evolving landscape of emerging technologies.

Mumbai-based AI expert Ritesh Bhatia, founder of V4WEB Cybersecurity, weighs in with a suggestion, especially for Indian LLM makers still taking baby steps in the segment: "LLMs will have to be trained by their creators to not reproduce content. As of now, LLMs are not that smart that they can create content all by themselves, but in future they will be able to create content that is not just unique but also humanised in its writing style." Notably, AI watchers say India will also see lawsuits when writers and content creators realise that their works have been used for training LLMs without compensation or acknowledgement. Globally, the likes of David Baldacci, John Grisham, and others have sued generative AI behemoths for flouting copyright norms, a euphemism for plagiarism. In India, Tech Mahindra has taken the lead in developing LLMs through its Indus project, which largely focuses on the Hindi language and dialects. There are other options currently available across several other Indian regional languages, including Tamil, Telugu, Malayalam, Odia, and so on. Most of these models are based on Meta's LLaMA; then there are AI chatbots like BharatGPT and Sarvam AI.

Amid charges of high-tech plagiarism, there are AI watchers who counsel that this is a passing phase and that LLMs will become synonymous with learning and research. Bengaluru-based writer and former MNC banker Sreejith Sreedharan, author of Future of Work: AI in HR, says, "At present, AI is in its developmental phase, with intense competition driving innovation. Given that AI development is predominantly driven by the private sector, the race for supremacy is fierce, with everyone vying for a piece of the pie. In this landscape, plagiarism persists as a means to claim one's rightful share in an unequal world. Managing plagiarism in the context of AI presents a complex challenge. While direct replication of content will be met with condemnation and penalties, current AI capabilities enable the synthesis and reinterpretation of human knowledge stored within its systems. I believe these initial hurdles are surmountable, as the benefits of this transformative technology become increasingly apparent to society."

OpenAI CEO Sam Altman (Photo: Getty Images)

Sreedharan argues that since AI represents a groundbreaking shift in technology, unparalleled in its transformative potential, it serves as the harbinger of a new era for humanity, one characterised by greater equity at its core. "Attempting to gauge the capabilities of this technology through existing frameworks will only impede its adoption. Established ideas surrounding copyright must adapt to accommodate this new paradigm. We are currently in the nascent stages of this transition, marked by upheaval as deeply entrenched beliefs and philosophies are challenged," he notes, adding that the capacity to generate text and images represents just a fraction of AI's potential. He elaborates, "Currently, there exist distinct categories of GenAI applications. One particularly promising aspect is its ability to produce synthetic data that mirrors the statistical attributes of real-world data. This development has sparked excitement within the pharmaceutical industry, offering new avenues for drug discovery. Additionally, AI's predictive analytics capabilities stand out as a game-changer, capable of processing vast amounts of data to uncover hidden patterns, facilitating easier analysis and problem-solving. For instance, improved weather forecasting resulting from predictive analytics could greatly benefit climate scientists and, subsequently, farmers responsible for food production. While I am not inclined to endorse a dystopian view, it is evident that AI technology will continue to evolve to address the transitional challenges we face, stemming largely from our struggle to adapt to the rapidly changing realities it presents. Undoubtedly, this represents a profound challenge to established philosophies and sensibilities. Looking ahead, the future iterations of advanced AI are poised to surpass the capabilities we currently observe."

Bhatia, too, shares this view. According to him, LLMs have been designed based on the content available on the internet. "It's at its nascent stage. Even the creators are improvising on it. They are learning from mistakes. I would say that AI has become so intelligent that it is learning from its own mistakes."

While some experts blame the new AI Chatbot for high-tech plagiarism, others state the technology is here to stay and can complement research. There is however consensus on compliance with transparency laws

WHATEVER EXPERTS SAY, transparency is central to the laws being formalised across various parts of the world to regulate AI and give people their due for original work. After all, organisations like New York Times and authors like John Grisham will otherwise be forced to compete with people who use synthesised versions of their content. "Compliance with laws that insist on not violating copyrights while training LLMs will have to be met," says a government official who works on AI policymaking. At the same time, fair use must be legally allowed for training, he adds.

As is well known, to check for any non-compliance and plagiarism, there are several apps already available that are zealously used by researchers and publishers to look for stolen content and ideas. Some that identify anomalies, although they also give false positives, include Plag.ai, GPTZero, Turnitin, etc. They even detect similarities in content. Technologists expect such applications to become far more robust, going forward.

Transparency will be key. Interestingly, Pulitzer Prize entrants this year were asked to disclose their AI usage in their journalism award category. Although the winners will be announced only in May, five of the 45 finalists in this year's Pulitzer Prizes for journalism used AI in the process of researching, reporting, or telling their submissions, according to a report quoting Pulitzer administrator Marjorie Miller. The idea here is not to encourage shunning or restricting the use of LLMs but to ensure disclosure of their use.

The concerns about LLMs go beyond mere plagiarism or copyright infringement. These models also exhibit inherent biases. In February this year, Google earned the wrath of the Indian government when Gemini, its LLM positioned as a competitor to OpenAI's GPT-4, offered preposterous responses to a question about Prime Minister Narendra Modi. Similarly, Open tried using Ola's Krutrim AI chatbot, which has made news for its inaccurate responses, including stating that Hillary Clinton had won the 2014 US presidential election, to acquire information. To the prompt about who is the prime minister of India, it simply gave this response: "I'm sorry, but my current knowledge is limited on this topic. I'm constantly learning, and I appreciate your understanding. If there is another question or topic you would like assistance with, feel free to ask!" But the AI chatbot did give a satisfactory answer to the prompt: Who is the president of the US?

When not generating amusement of this sort occasionally, LLMs, like other AI products, are caught in a war of words over ethical concerns, business rivalry and climate change. Altman and Elon Musk are sparring over business and concerns about the use of AI for humanity. But the most worrying aspect is the environmental cost of generative AI products like ChatGPT. In its latest environmental report, Microsoft said its global water consumption rose 34 per cent from 2021 to 2022 and experts link the sharp rise to AI research. Google also reported a 20 per cent growth in water usage during the same period, according to an Associated Press report. The Yale School of the Environment published a report saying, "generative artificial intelligence uses massive amounts of energy for computation and data storage and millions of gallons of water to cool the equipment at data centers", and some scholars expect social tensions to rise thanks to water scarcity. Meanwhile, AI companies have promised to look at energy-efficient ways of developing LLMs.

Among the myriad concerns somehow, plagiarism is the most talked about, perhaps for obvious reasons.

Related Tags

OpenAI

Sam Altman

LLMs

AI language

Digital Plagiarism

models