
IN THE HALLS OF Parliament, where microphones crackle, dialects collide, and debates are often unruly, the act of transcription can feel Sisyphean. For years, the official record has struggled to keep pace. But when BharatGen’s automatic speech recognition was integrated into the Sansad TV system, it resulted in a 30 per cent relative improvement in word error rate, better handling of dialects, and fewer phrases lost in the melee. It was, in some ways, proof of concept for BharatGen. If it could build a system that survived the chaos of Parliament, then perhaps its voice AI solutions could decipher grievance calls made amidst kitchen clatter and read between the silences of a rural health survey where female interviewees barely speak above a whisper.
That is the story of BharatGen, a government-backed, academically driven effort incorporated as a non-profit at the Indian Institute of Technology Bombay (IIT Bombay) and run by a consortium of nine institutions—IIT Bombay, IIT Madras, IIIT Hyderabad, IIT Kanpur, IIT Hyderabad, IIT Mandi, IIM Indore, IIIT Delhi, and IIT Kharagpur—that is building India’s national sovereign AI ecosystem and the first government-funded Multimodal Large Language Model (MLLM). It aims to deliver models that are accurate, affordable, and deployable at scale, seeding an ecosystem that startups, system integrators, and government departments can adapt for real-world use.
“What UPI did for payments, voice AI must do for services—work everywhere, for everyone, in every Indian language,” says Ganesh Ramakrishnan, principal investigator at BharatGen and a professor of the Department of Computer Science and Engineering at IIT Bombay. BharatGen operates as a professional company with returning diaspora leadership—executive vice president Rishi Bal after stints at Microsoft and Google Research, and Maneesh Singh, a senior AI researcher with over two decades of experience in the US—alongside academic leads. The effort is funded at a scale that is meaningful, even if not yet comparable to those of the US or China: `235 crore from the Department of Science and Technology, and another `988.6 crore in September 2025 from the Ministry of Electronics and Information Technology through the IndiaAI Mission, making BharatGen the largest single beneficiary of the Union government’s `1,500 crore allocation in this year’s Budget.
31 Oct 2025 - Vol 04 | Issue 45
Indians join the global craze for weight loss medications
Earlier this year, BharatGen launched Param-1, a prototype on the way to multilingual and multimodal models across 22 scheduled languages. With 2.9-billion bilingual parameters, the model is trained on five trillion tokens in English and Hindi—small by global standards, but a visible stake in the ground. Param-1 serves as the foundation for a range of downstream applications and smaller models, positioning BharatGen not simply as a vendor of AI services but as the national platform upon which voice and text intelligence can be built, adapted and scaled for Indian languages and contexts.
BharatGen employs a two-tier strategy—first train powerful teacher models on broad corpora, then distil these into leaner ‘student’ variants for various domains that startups, banks and government bodies can deploy with low latency and cost. By releasing models openly and offering tools for fine-tuning them, BharatGen hopes to catalyse an ecosystem of developers who do more than consume AI—they contribute to it, adapt it and publish it. The idea is to provide a layered portfolio where a user might pick a compact model for a bank’s chat agent, or engage the full context-rich stack for a multilingual call-centre. “Small models distilled from large ones are always more effective than small models trained from scratch. We want to empower users so that they are not starting from scratch but from the shoulders of a model already trained on the linguistic diversity of India,” says Ramakrishnan.
The use cases that have found traction so far are not the glamorous ones. They are grievance redress calls, where citizens complain in half-sentences and overlapping speech; healthcare, where discharge notes and prescriptions must be turned into something a patient can actually read; education, where lectures need live transcriptions and later revisions. Ramakrishnan frames the use cases of AI as a cycle—pre, during, post—where pre means preparing material, during is live support, and post entails follow-up. So far, governance has gravitated to the “during” phase, education to “during” and “post”, and healthcare to “pre” and “post”, he says.
At the edge of BharatGen’s endeavour lies Decile, an open-source toolkit built for the most common limitation in AI— not having enough data. Its premise is deceptively simple: Why drown models in oceans of redundant examples when you can teach them with carefully chosen ones? The aim is to sift, select, and sequence data so that each sample carries weight, allowing models to learn faster with fewer labels on slimmer budgets of compute. In the Indian context, where annotated corpora in certain languages are rare and expensive, this allows BharatGen to turn fragments of speech scattered across dialects and geographies into a coherent training ground. What BharatGen is trying to show is that intelligence emerges as much from curation as from brute force, and that sovereignty in AI will be won not by excess but by efficiency.
A NEW PAPER BY the BharatGen team proposes a framework to improve text classification when inputs are incomplete or underspecified, as often happens in real-world tasks like symptom reporting or customer complaints. Instead of forcing a model to guess from fragments, “GUIDEQ” trains a classifier to identify top candidate labels, extracts keywords tied to those labels, and then uses a large language model (LLM) to generate targeted follow-up questions that elicit the missing details. Those answers are added back into the input, and the classifier makes a final prediction. Tested across six datasets, this guided questioning approach consistently outperformed both partial-input baselines and naïve question generation, with gains of over 20 percentage points in some cases. The work shows that strategically asking for the right information can dramatically improve classification, and it suggests a pathway for building AI systems that operate effectively in settings where users rarely provide clean or complete data.
The BharatGen team has assembled roughly 13,000 hours of Indian-language speech and built an explicit quality-control regimen, enforcing sampling-rate consistency, flagging “silence-heavy” clips, catching metadata errors such as the same speaker being mislabelled as different people. Collection of this data is distributed across five independent vendors—“to avoid both monopoly and monocultures”—and accompanied by capacity building so those vendors learn to apply the same checks themselves. The philosophy is to curate for three things at once: representation (cover the breadth of languages and sub-varieties, especially under-represented ones like Santali or Mundari), diversity (capture variation within populous languages such as Hindi), and targeting (if a client needs Assamese-Malayalam-English robustness, select data that best spans that triangle).
Ramakrishnan says India’s diversity is not an obstacle but a design constraint, one that demands script-aware training and careful bootstrapping across language families. In a curious way, it is also an advantage: the subject–object–verb structure of most Indian languages, the continuum of dialects from, say, Delhi to Gorakhpur, the phonetic overlaps that link Hindi and Marathi, all offer models a kind of continuity that English lacks. “Many Western text-to-speech models assume the availability of phoneme duration annotations—labels that specify how long each sound is held during training. For Indian languages, such data is prohibitively expensive to obtain. Instead, we have been experimenting with ways to bypass this requirement by exploiting the phonetic continuity and prosodic patterns that stretch across Indian tongues,” he says.
At the level of models, BharatGen does not prescribe a single architecture, instead maintaining a portfolio calibrated to trade-offs. “There is no free lunch,” Ramakrishnan likes to remind: what you gain in fidelity you may lose in speed or efficiency. For everyday uses such as grievance redressals, a compact 120-million parameter Automatic Speech Recognition (ASR) model, boosted with Language Model (LM) rescoring, is sufficient and light enough for low-power devices. But more demanding, context-rich tasks—parsing parliamentary transcripts, disambiguating accents, detecting intent—require a speech-to-LLM pipeline in which a mixture-of-experts projector feeds into models running into the billions of parameters. Text-to- Speech (TTS), likewise, comes in tiers: a 16-million-parameter, speaker-conditioned model that can clone a voice from a 10-second sample, and a 150-million parameter transcript-conditioned model that captures rhythm and prosody more faithfully, with the option of adding LLM guidance to infer tone or emotion.
During the course of our conversation, Ramakrishnan lingers on the anatomy of the machine, sketching out its organs. At the top sits the LLM, the voracious brain that eats up most of the electricity and silicon, billions of parameters humming in chorus. Beneath it lies the speech encoder, the lungs and vocal cords, heavy in their own way, converting the chaos of sound into something the brain can read. And then, like a narrow bridge, the projector links one to the other—lighter, cheaper, a sliver of code compared to the weight above it.
The elegance is in the decoupling. You don’t have to buy the whole cathedral if all you need is the chapel. A small company can borrow the national brain and train only its bridge; a university lab can tinker with the encoder without touching the rest. Some may bypass the brain entirely and make do with a compact recogniser or a voice clone. In this layered design, computation is a menu. The choice of module becomes a kind of sovereignty in itself, letting India’s AI builders pick the scale they can afford, without giving up on ambition.
BharatGen is still at the beginning of its arc, yet its wager is clear: India cannot afford to be a latecomer in voice and multimodal AI, and the only way to catch up is to build a sovereign ecosystem that is open, shared, and relentlessly practical.