How to build your own AI-powered voice assistant

Chaitanya Kale
4 min readJan 29, 2021

--

Source: https://wallpapercave.com/jarvis-wallpapers

Back in 2008, I met with Jarvis, back then for me it was just like a sci-fi unrealistic dream. I used to think Come on!! We need a Tony Stark to build it. But now when I think of Google and Siri it's like that dream coming true.

“ Sometimes you gotta run before you can walk”
~Tony Stark(Iron Man)

Ever wondered how Google assistant and Siri can speak with us exactly like humans. This is the magic of Deep Learning.

So without wasting time let's jump directly to the topic.

Data Flow Diagram For Voice Assistant

The above diagram will help you to get an overview of how the process happens inside the voice assistant.
First I will explain each process in-depth and in the end, I will summarise the entire process with the help of an example.
So, Let’s Begin our Journey

  • Speech to Text
    This is the process of converting speech to text. It is also called speech recognition. Through this process user’s commands are converted into text. Now, this text is used by further processes to extract information from it. Note that the speech-to-text model should be independent of the user’s accent and way of pronunciation.
  • Intent Classification
    Intent classification can be thought of as mapping a query given by a user to an action that is needed to be performed by the voice assistant for that query. For eg: If the user query is like ‘ What is the trading news ’ then the intent can be ‘get_news’.
    In intent classification, we are basically dealing with text sequence. For that reason, RNN i.e Recurrent Neural Networks will be the most suited for this kind of work.
  • Entity Recognition
    In Entity Recognition we try to extract the key terms present in the text. For example, consider the sentence ‘Send this mail to Vinay at 12:15 pm’. Here the entities are ‘Vinay’ which can be categorized as the name of a person and ‘12:15 pm’, this comes under the category of time.
    To determine what kind of entity the current word is we need to consider both, the words before and after that word in the sentence. For that purpose, we use Bidirectional RNN for Entity Recognition.
  • Predict Response
    This part deals with the type of response that should be given to the user depending upon the input query. The predicted response should be relevant to a user command. This response is predicted based on the output of intent classification and entity recognition.
  • Text to Speech
    Text to speech is the conversion of text into audio. This audio acts as an output for the users. Before converting text to speech, care should be taken that the given text is free from ambiguity and repetition of information. The generated speech should be clearly audible and understandable. Along with this we also need to take care of the pronunciations of words and pauses taken for punctuation.

To understand the coordination among each process and visualize the flow of data let's summarise the whole process with the help of an example.

Suppose you raise a query to the voice assistant “Who is Shahrukh Khan”. Before beginning any kind of processing to understand your command the first step that needs to be done is, your voice i.e audio is converted into text, this is called speech to text. After converting the speech to text, on the generated text we will perform intent classification and entity recognition.

Intent classification can be thought of as mapping a query to an action that is needed to be performed by the voice assistant. In our case, the generated intent can be ‘Search’ which means we want to search for something.
The next step after intent classification is Entity Recognition. In this step, we find what are the entities in our sentence. In our example, ‘Sharukh Khan’ is an entity that can be categorized as a person.
Now we know that our intent is ‘Search’ and our entity is ‘Shahrukh Khan’ so from this the voice assistant can figure out that we need to search about Shahrukh Khan and the information we get should be conveyed to the user. This is what we do while predicting the response. Now in the final step, we convert our response which is in the text form to speech and the audio is outputted to the user. This is how the entire process happens.

I hope this blog helps you to get an intuition behind how voice assistant works. I would love to have your suggestions and thoughts on this blog. If you find this blog helpful share it with your friends.

--

--