We talk to a lot of people about their IVR (Interactive Voice Response) systems, and we make a point of calling up all sorts of different companies to see how their IVR systems sound.  Something we’ve noticed is that more companies are asking about, and deploying, ‘Natural Language’ IVR solutions.  Another thing we noticed is that their reasons, expectations and outcomes are quite varied.  For some it’s driven by a desire to reduce customer effort, others want to keep up with their competitors, and some have been sold on a technology that actually makes little sense for their business. So, what’s a Natural Language IVR?  Why should you deploy one?  And how do you do it?


A Natural Language (NL) IVR is one that uses a particular type of automated speech recognition (ASR) technology that allows callers to say what they’re calling about in a wide variety of ways, so instead of prompting them to say specific phrases, the system will typically just say something like: ‘Welcome to VoxGen, how can I help you today?’.

Sounds great, and pretty natural, but callers can be confused by such an open-ended prompt because they need to work out what they think an appropriate response will be. You often get better accuracy by providing a few examples, like: ‘you can say things like: what’s my balance or I want to pay my bill’.  Which starts to sound a little less natural.

There’s also a problem with callers giving very long explanations… which is somethingthat I do a lot when speaking with a human agent, but it’s a bit too much for an automated system to understand, which is why some systems try to avoid this by saying: ‘in a few words, tell me how I can help you today?

So what’s going on under the cover?  The difference is in the way the speech recognition works.  Bear with me here, because it will help to understand the limitations, and also the process for building and tuning NL systems.  An NL system consists of two key elements: a statistical language model (SLM) and a statistical semantic model (SSM).  The language model helps the system recognise the sequence of words that the caller said, while the semantic model helps the system understand what it meant – the caller’s intent.  The key thing to understand, is that they are just ‘pattern matchers’ that are trained on a large number of sentences that callers typically say when asked ‘how can I help you today?’.  The SLM simply learns what sequences of words are spoken, and the SSM learns how to match a bunch of words with a request type.  After training, the NL system learns how to take an input from the caller, like: ‘I want to pay my bill’, and convert it into a meaning, like: Call_type=pay_bill.  In some systems, it will also try to capture extra information, so if the caller says: ‘I want to pay my gas bill’, the system might be able to capture a more complex meaning, like: Call_type=pay_bill, Product_type=gas.


So why would you want to do this?  The idea is that it can create a better caller experience, and help get callers to the right place, with less effort, which could be a win-win, because the experience becomes effortless for the caller, and they get to the right place: a well-designed automated system that lets them complete their task quickly and easily, saving callers time and your business money, or the right agent who can help them, avoiding annoying and costly internal transfers.  Get it right, and there’s a strong business case, including cost-savings and customer experience benefits that can lower costs, and drive loyalty and retention, ultimately increasing profits.  An NL solution is not right for every situation, but if you’ve got a large number of different destinations, or agent skill groups that calls need to be directed to, and there’s a reasonably high volume of calls that can justify the initial investment and ongoing tuning costs, it's definitely worth considering.

Remember, the NL solution won’t understand every caller, and some callers will find it difficult to respond to the open question.  You could just route those calls to a general agent pool, but that might mean an extra transfer, which costs money and impacts customer experience, so you will probably still need a good ‘directed dialogue’ call routing menu as a backup.


There are 8 key steps to implementing an NL solution:

Initial design: Design the prompting strategy and record the opening prompts

Data collection: Build a simple IVR application that uses the prompts from the initial design.  A percentage of calls are then sent to this application and caller responses, known as ‘utterances’, are recorded and stored.  You’ll typically need at least 30,000 utterances, but the more the better, and some systems will need 100,000 or more to build a good model.

Transcription: The utterances need to be transcribed.  This starts with manual transcription, but you can speed things up by using the speech recogniser to come up with the initial transcription, and then let human transcribers verify and correct the automated transcriptions.

Tagging: To train the Statistical Semantic Model (SSM), which interprets the words a caller says and determines their intent, the transcriptions need to be ‘tagged’ with an interpretation, or intent.  Again, this can be done with a combination of manual and semi-automatic tagging, and you’ll need to develop a ‘tagging guide’ that makes sure the transcribed sentences are tagged consistently, and in a way that suits the breakdown of call types and skill groups or routing destinations in your business

Sentence generation: Often there will be certain utterances and tags that are not well represented in the tagged utterances, so the data can be augmented with additional tagged utterances that are created by hand or generated from a grammar

Model build and test: The transcriptions are used to train the statistical language model (SLM) and the statistical semantic model (SSM).  Typically, around 10% of the data is held back as a ‘test set’ and used to test the models that are built – this helps to give an indication of how well the system will perform in live

Deploy: Once built, tested and tuned ‘offline’, it’s time to integrate the NL model into your existing IVR application, and deploy it for live callers.  It’s a good idea to do this gradually, so you just send a small percentage of calls to the system to start with, to make sure everything’s working well before ramping up the volumes of calls

Tune: After go-live, you’ll have a lot more data to work with, so watch the results carefully, check the impact on call routing accuracy, and collect recordings from the live system that you can use for tuning.  Tuning is basically an additional cycle through these steps, including updates to the design of the prompting, which is sometimes necessary to improve performance – e.g. by providing some examples of what callers can say in the initial prompt.

Other Considerations

Quite often, the caller will not give enough information in their initial response, or the recogniser won’t pick up all the details, so in addition to the open-ended part of the NL solution, you’ll need some extra dialogue steps to ‘disambiguate’ a caller’s intent.  In a study we conducted that compared a NL call routing design and a more directed dialogue solution, we found that callers prefer the experience of a well-designed and accurate NL solution, compared with a the directed dialogue approach, but they liked it even more when the system used data to predict why they were calling and ask them a simple question, like: ‘I looked up your number and found you have a prescription due for renewal, is that what you’re calling about?


NL solutions can improve customer experience by reducing customer effort and ensuring customers get to the right agent, or self-service automation to complete their tasks quickly and easily.  And it can reduce costs with fewer internal transfers and greater automation, but it’s not right in every situation, and this type of solution is more expensive to build, run and maintain, so it’s important to run the numbers and take a careful look at the costs as well as the benefits and check there’s a compelling ROI.  An NL solution is best when combined with directed dialogue for disambiguation and as a fall-back mechanism when callers don’t respond, or aren’t understood, and if you’re able to capture the number of the caller, and find a matching record in your CRM system that lets you predict their reason for call, that can be the ultimate experience.  It’s like going into a coffee shop and ordering your favourite beveridge.  It’s great when you get a welcoming smile and an enthusiastic: ‘How can I help you?’, but it’s even better when you go to your local coffee shop and they say: ‘Regular decaf Americano as usual?’.  Yes please!

Topics: User Experience , IVR Features