Introducing Sarvam-1 for Indian Languages

Transform your hiring with Flipped.ai – the hiring Co-Pilot that's 100X faster. Automate hiring, from job posts to candidate matches, using our Generative AI platform. Get your free Hiring Co-Pilot.

Flipped.ai’s weekly newsletter read by more than 75,000 professionals, entrepreneurs, decision makers and investors around the world.

In this newsletter, we spotlight Sarvam AI Datalabs’ new Sarvam-1, a 2-billion-parameter model optimized for 10 Indian languages, including Hindi and Tamil. Designed to address token inefficiency and data quality, Sarvam-1 enhances AI processing for Indian languages, making it faster and more accessible.

Before, we dive into our newsletter, checkout our sponsor for this newsletter.

Remember the days of paper lists and mall marathons? Now it's Amazon Prime time. While free shipping and exclusive shows are a given, these 10 hidden perks can enhance your membership in unexpected ways.

Sarvam AI introduced Sarvam-1: India's first Indic language LLM

The rapid evolution of artificial intelligence (AI) is reshaping numerous industries, and large language models (LLMs) are at the forefront of this transformation. For Indian AI startup Sarvam AI, their newly launched Indic language model, Sarvam-1, promises to bridge a long-standing gap in AI accessibility for millions of Indian language speakers. Built with 2 billion parameters, Sarvam-1 offers advanced support across 10 major Indian languages and English. This milestone marks an important shift, not only in multilingual AI but in creating AI solutions designed for cultural and linguistic diversity.

Sarvam-1 goes beyond existing LLMs, boasting unique capabilities like optimized token efficiency for Indian languages and substantial advancements in training data quality through their proprietary corpus, Sarvam-2T. Unlike traditional language models that prioritize English and high-resource languages, Sarvam-1 prioritizes Indian languages without compromising performance. In this article, we delve into the technical aspects, performance metrics, unique features, and real-world applications of Sarvam-1, examining how it compares to competitors like Llama-3.2 and Gemma-2.

Background: The need for an Indic language LLM

LLMs have historically been trained with an English-centric approach, leaving Indian languages underrepresented. Despite a rising number of bilingual speakers, millions of Indians communicate primarily in regional languages, creating a digital divide in AI-driven technologies. While there are ongoing efforts to incorporate Indic languages into existing LLMs, such adaptations often fall short of addressing the unique needs of Indian languages, including script complexities, grammatical structures, and unique cultural contexts.

For instance, existing multilingual models like Llama-3 and Gemma-2 exhibit high token fertility (the number of tokens required to represent each word) when processing Indian languages, resulting in inefficient language processing. Sarvam-1 directly addresses these inefficiencies, marking a pioneering approach to building an Indic-focused model that offers both accuracy and computational efficiency.

Technical overview: Sarvam-1's architecture and training

Sarvam-1’s development, funded and supported by Sarvam AI, is rooted in innovative architectural decisions that optimize it for multilingual support without compromising processing speed or accuracy. This section outlines the technical features that set Sarvam-1 apart from other models:

Model specifications
Sarvam-1 operates on a 2 billion parameter framework, which, while smaller than competitors like Llama-3’s 8 billion parameters, delivers competitive performance through efficient architecture. Sarvam-1’s “deeper and thinner” model configuration enables efficient data processing, using a hidden size of 2048, 11,008 intermediate size, and 28 hidden layers. This structure facilitates cross-lingual tasks, such as question answering and summarization, across Indian languages.

Efficient tokenization
A standout feature of Sarvam-1 is its custom tokenizer, designed for Indian scripts. Tokenization is critical for language processing models as it breaks down text into units that the model can interpret. Sarvam-1’s tokenizer achieves a fertility rate of 1.4–2.1 tokens per word, making it notably more efficient than models like Llama-3.1, which often requires up to 8 tokens per word for Indic scripts. This lower fertility rate means that Sarvam-1 can process text more rapidly and with fewer computational resources, making it viable for deployment on devices with limited computational capacity.

Training Corpus: Sarvam-2T
Sarvam-1’s effectiveness is further enhanced by Sarvam-2T, an Indic-focused training dataset comprising approximately 2 trillion tokens. Unlike web-crawled Indic datasets like Sangraha, Sarvam-2T includes curated, high-quality content distributed across 10 Indian languages and English, with Hindi making up 20% of the total data. The dataset spans technical, scientific, and general knowledge content, allowing Sarvam-1 to handle complex reasoning and domain-specific tasks better than previous models.

Benchmark performance: How Sarvam-1 stacks up against competitors

Sarvam-1 is open-source and supports ten Indian languages. Source: Sarvam.ai

Sarvam-1 has demonstrated impressive results across several benchmarks, achieving strong scores on the TriviaQA and IndicGenBench. Here’s an analysis of its performance metrics and comparisons to other models:

Indic language benchmarking
Sarvam-1 achieved an 86.11 accuracy score on the TriviaQA benchmark, significantly surpassing Llama-3.1’s score of 61.47 for Indic languages. TriviaQA measures knowledge retrieval, showing Sarvam-1’s high capability in Indian language comprehension and accuracy.

IndicGenBench results
On the IndicGenBench, which includes cross-lingual tasks such as translation and summarization, Sarvam-1 consistently outperformed Llama-3.2 3B and Gemma-2 in translation tasks, scoring an average chrF++ score of 46.81 on the Flores dataset. Sarvam-1’s ability to perform well across languages within a relatively small parameter count shows the importance of curated training data and efficient model design.

Inference speed
Sarvam-1’s computational efficiency further distinguishes it from its peers. The model performs 4 to 6 times faster than larger models like Llama-3.1 8B, offering a balance of speed and accuracy that is vital for practical applications, especially on edge devices with limited resources.

Innovations in data quality and token efficiency

Data quality is critical in training LLMs, particularly for low-resource languages. Sarvam-2T sets a new standard by curating high-quality, balanced content for Indian languages. Here’s how this innovation impacts Sarvam-1’s performance:

Comprehensive dataset composition
Sarvam-2T includes long-form documents, technical content, and scientific material, enhancing Sarvam-1’s capacity to understand context and complex ideas. The dataset has a higher concentration of scientific and technical content (8x more) compared to other datasets, making Sarvam-1 highly suitable for applications requiring detailed comprehension.

Token efficiency: A key to better performance

Comparison of tokenizer fertility between Sarvam-1 and other popular LLMs. Source: Sarvam.ai

The Sarvam-1 tokenizer’s reduced fertility rate enhances efficiency, allowing for more streamlined text processing. This optimization means that Indian languages, which often have longer words and more complex scripts, can be processed with fewer tokens, reducing the computational burden and making Sarvam-1 a practical choice for resource-constrained settings.

Real-world applications and societal impact

Sarvam-1’s design and capabilities open up numerous practical applications for Indian users and enterprises looking to engage in their native languages. Here are some of the areas where Sarvam-1 could have significant impact:

Public services and e-governance
Sarvam-1’s multilingual capabilities make it an ideal tool for public service applications, where communication in regional languages is essential. The model can assist with translation, content generation, and summarization, enabling government agencies to provide better service across linguistic barriers.

Education and knowledge access
In educational contexts, Sarvam-1 offers considerable value by providing accurate translations, summaries, and explanations in regional languages. It can enhance learning resources, making technical and scientific knowledge more accessible to students who prefer or require instruction in their native languages.

Healthcare
In healthcare, where effective communication is crucial, Sarvam-1 can assist with translation and interpretation for patients and professionals. This can improve patient outcomes by ensuring that medical advice and instructions are delivered clearly in a language patients understand.

Business and e-commerce
For Indian businesses aiming to reach wider audiences, Sarvam-1 can provide multilingual customer support, personalized marketing content, and interactive AI assistants. The model’s efficiency means that it can be used in high-traffic applications, such as chatbots and virtual assistants, without excessive resource demands.

Comparing with industry giants: Llama-3 and Gemma-2

Comparison of document length and quality: Sarvam-2T vs. Sangraha. Source: Sarvam.ai

While other LLMs like Meta’s Llama-3 and the open-source model Gemma-2 are advancing in multilingual AI, they fall short in certain areas that Sarvam-1 has optimized:

Parameter efficiency
Both Llama-3 and Gemma-2 are larger models that require more resources. Despite its smaller size, Sarvam-1’s targeted parameter distribution and optimized tokenization deliver comparable or superior results on key benchmarks, particularly in Indian languages.

Cost and accessibility
The large-scale models that lead global benchmarks often demand significant computational power, making them inaccessible for many applications within India. Sarvam-1’s efficiency, however, enables high performance on more modest hardware, making it accessible for small businesses, startups, and public sector applications.

Indic-centric design
Most international models are not inherently designed for Indian languages, often relying on web-crawled data that lacks linguistic nuance. Sarvam-1’s ground-up design for Indic languages allows it to capture these nuances, providing a more accurate and culturally relevant experience.

Translation accuracy vs. Inference time. Source: Sarvam.ai

The future of Indic LLMs: Prospects for Sarvam-1 and beyond

Sarvam-1’s success underscores the potential of Indic-centric AI. The launch of Sarvam-1 and its corpus, Sarvam-2T, marks a promising shift toward more inclusive and accessible AI models in India. Here’s what lies ahead:

Scaling the model
Future iterations of Sarvam-1 could involve larger parameter sizes and additional languages. This expansion would extend the model’s reach and enhance its capability to handle more complex and domain-specific tasks.

Improving data diversity
Sarvam AI may further enhance Sarvam-2T with more nuanced datasets, encompassing rural dialects, traditional knowledge, and niche domains like legal or agricultural content. This would deepen the model’s relevance for India’s diverse population.

AI policy and ethics
As Sarvam-1 gains traction, Sarvam AI and other stakeholders will need to address ethical concerns regarding data usage, bias, and AI safety. Collaborative efforts with government and academia could foster responsible development and deployment practices.

Conclusion

Sarvam-1 is a groundbreaking language model that exemplifies India's growing expertise in AI and natural language processing. Its release is a significant milestone for making AI accessible to millions of Indian language speakers and is a step forward in creating models that can support India’s linguistic and cultural diversity. With strong performance on key benchmarks, innovative tokenization techniques, and a high-quality dataset, Sarvam-1 is poised to reshape the landscape for Indic languages in AI. As AI adoption grows in India, Sarvam-1’s advancements promise to unlock new possibilities for public services, education, healthcare, and beyond, showcasing a vision of AI that is inclusive, efficient, and attuned to India's unique needs.

Get exclusive strategies, updates, and trends delivered right to your inbox. Join this newsletter today and stay informed!

Sponsored
Forests Over TreesA Tech Strategy Newsletter
Hire top-quality Indian tech talent with SourceTalent.ai by Flipped.ai

Need to build a world-class tech team in India without breaking the bank? SourceTalent.ai offers an AI-powered, cost-effective hiring solution just for you!

Key Benefits:

  • Instant access: Tap into a vast pool of 24M+ Indian candidates with personalized recommendations.

  • AI-powered matching: Our advanced algorithms connect you with candidates that perfectly fit your job requirements.

  • Automated hiring: Simplify the process with AI-driven job descriptions, candidate screening, and tailored recommendations.

  • Seamless video interviews: Conduct unlimited interviews effortlessly and gain valuable insights.

  • Affordable excellence: Prices start at just Rs400 / $5 per job posting.

  • Top talent pool: Access a diverse selection of India’s best tech professionals.

  • Efficient hiring process: Enjoy a streamlined recruitment process with video assessments.

  • Global reach: US companies can also leverage India’s premier tech talent!

Get started today at SourceTalent.ai and take advantage of our exclusive launch offer: [Link]

For more information, reach out to us at [email protected].

Experience smarter, faster, and more affordable hiring with SourceTalent.ai!

Want to get your product in front of 75,000+ professionals, entrepreneurs decision makers and investors around the world ? 🚀

If you are interesting in sponsoring, contact us on [email protected].

Thank you for being part of our community, and we look forward to continuing this journey of growth and innovation together!

Best regards,

Flipped.ai Editorial Team