Curious logo

Dear Curious Readers, No new content will be published for the next few months due to website changes.

 
Image depicting The Future of Indian Languages in the AI Era!

The Future of Indian Languages in the AI Era!

Recommended for Middle Grades

Hey there, young explorer! Have you ever wondered how Indian languages are doing in the age of artificial intelligence (AI) and language models? It’s quite an interesting topic! Let’s dive in and discover some important details together.

Now, you may have heard people talking about Sanskrit and its connection to computers and AI. The chairman of the Indian Space Research Organisation, recently mentioned that Sanskrit is a great language for learning about AI. But here’s the catch: there’s no evidence or explanation to support these claims. So, it’s a bit of a mystery!

But let’s move beyond Sanskrit and explore how other Indian languages are faring in the world of AI. You see, AI has taken the world by storm with its language-based applications. It’s both exciting and challenging for Indian languages.

The situation is a mixed bag. On one hand, there is some passive discrimination against Indian languages. But on the other hand, there are people who are working hard to research and innovate, supporting the growth of these languages in the realm of AI.

Indian Languages in AI: Opportunities and Challenges

  • To understand how AI works with languages, we need to learn about tokenization. When a machine works with a language, it breaks down sentences or words into smaller bits. These bits, called tokens, are what the machine processes. For example, the sentence “there’s a star” can be tokenized into “there,” “is,” “a,” and “star.”
  • There are different ways to tokenize languages. One technique is called a treebank tokeniser, which follows the rules that linguists use to study words and sentences. Another technique is a subword tokeniser, which helps the model learn common words and their modifications separately. It’s like learning different forms of a word, such as “dusty,” “dustier,” and “dustiest.”
  • OpenAI, the company behind ChatGPT, uses a subword tokeniser called byte-pair encoding (BPE). It helps process different languages efficiently. By using BPE, AI can understand and respond to various languages effectively.
  • Now, let’s talk about a fascinating study by an AI researcher. The researcher used a massive database called MASSIVE. It contains utterances (simple queries or phrases) in 52 languages. This includes Indian languages like Hindi, Urdu, and Bengali. The researcher discovered that Hindi phrases were tokenized into more tokens than their corresponding English phrases. This can affect the cost and resource consumption of AI models.
  • Furthermore, AI models like GPT and ChatGPT can handle a fixed number of input tokens at a time. This means they are better at processing English text compared to languages like Hindi, Bengali, and Tamil. To ensure fair language processing across diverse communities, understanding tokenization nuances is essential.
  • But don’t worry! ChatGPT, doesn’t struggle with languages other than English. GPT-4, the powerful model behind ChatGPT, has been trained in various languages worldwide. It can switch between languages fluently. However, there is a cost difference, as running the model with more tokens increases the operational cost.
  • Now, when it comes to training AI models, having a large amount of training data is crucial. Unfortunately, the training data available for Indian languages is much smaller compared to English. ChatGPT was trained using text from the internet, and a significant portion of online content is in English.
  • But there’s hope! Initiatives like AI4Bharat, an initiative by IIT Madras, are actively working on building language AI for Indian languages. They create datasets, models, and applications specifically designed for Indian languages. They aim to bridge the gap and bring innovation to Indian languages using AI.

Similar Stories

Watch a video

Unlock your child’s chatbot-building potential with engaging tutorials from ‘Happy Code Club’ on YouTube!

Image depicting Curious Times Logo

Curious Times is a leading newspaper and website for kids. We publish daily global news aligned to your learning levels (also as per NEP 2020): Foundational, Preparatory (Primary), Middle and Senior. So, check out the News tab for this. We bring kids’ favourite Curious Times Weekly newspaper every weekend with top news, feature stories and kids’ contributions. Check out daily JokesPokeTongue TwistersWord of the Day and Quote of the Day, kids need it all the time.

ME – My Expressions at Curious Times is your place to get your work published, building your quality digital footprint. And it is a good way to share your talent and skills with your friends, family, school, teachers and the world. Thus, as you will step into higher educational institutes your published content will showcase your strength.

Events, Quizzes and Competitions bring students from over 5,000 schools globally to participate in the 21st-Century themes. Here schools and students win certificates, prizes and recognition through these global events.

Sign-up for your school for FREE!

Communicate with us: WhatsAppInstagramFacebook, YoutubeTwitter, and LinkedIn.

  (Please login to give a Curious Clap to your friend.)

Share your comment!

To post your comment Login/Signup