Artificial Intelligence means different things to different people. But it is all about making machines intelligent. From recommendation systems in e-commerce sites to translation systems from google to self driving cars run on some of AI either partially or entirely. For us, it is a means to understand language. Specifically machine learning methods for natural languages.
It can convert speech to text so that you can record your voice and send a text message or write an essay.
It can generate speech from the text so that you read your favorite books while traveling on a bus.
It can look at an image and convert it to text.
It can translate between a number of languages.
AI tries to solve various kinds of problems. What we are doing is just one piece of it, and it is called NLP.
Natural Language Processing is the field in AI that deals with problems arising in areas where systems have to process and/or respond to languages used by humans. Simply NLP is how we make computers understand language and interact with us in natural language.
But language is a complex beast. Nobody understands how we humans understand language. How can we make computers understand it when we do not know how we do?
To mitigate this paradox and make progress in baby steps, NLP researchers have come up with a list of simple tasks to solve before we solve language in general. Text classification, translation and named entity recognition are few examples.
Imagine a playground where children are playing a game, which you have never played or have seen. Just by observing them for a while, you can understand the rules of the game.
Machine learning research and its applications have become a vital part of our everyday life. These systems are not programmed the way software used to be. They are trained on data, they learn on their own by reading a large amount of data.
Existing datasets for Indian languages are not available for public or very tedious to obtain. They are behind the walled gardens of corporations and universities. Even the few who gains access to the dataset, are not allowed to share with fellow researchers. We aim to collect datasets that focus on regional indian languages and keep them in the public domain for everyone forever. There are hundreds of regional language and numerous dialects in India. Availability of datasets in the open domain encourages and helps you, I and everyone to learn and build AI applications. This will help everyone to focus on their actual work, easing access to data and prevent redundant data collection.
Our project, vaaku2vec model reads the news articles again and again and tries to make out the sense of the words. It does this not just by reading the words, but by understanding in which order they appear, which words appear together and other patterns. The rules of how words can be strung together. The rules of language are understood intuitively by humans, for example, the sentence
"I live in Tambaram and _______ to Adambakkam for work"
the blank can be filled completed in by anyone of using commute, bike, drive etc. Such a task would be herculian for normal systems/algorithms.