Voice AI Technology Is More Advanced Than You Might Think

For Annie Brown for Forbes

Systems that can handle repetitive tasks have supported global economies for generations. But systems that can handle conversations and interactions? Those have felt impossible, due to the complexity of human speech. Any of us who regularly use Alexa or Siri can attest to the deficiencies of machine learning in handling human messages. The average person has yet to interact with the next generation of voice AI tools, but what this technology is capable of has the potential to change the world as we know it.

The following is a discussion of three innovative technologies are accelerating the pace of progress in this sector.

Conversational AI for Ordering

Experts in voice AI have prioritized technology that can alleviate menial tasks, freeing humans up to engage in high-impact, creative endeavors. Drive-through ordering was early identified by developers as an area in which conversational AI could make an impact, and one company appears to have cracked the code.

Creating a conversational AI system that can handle drive-through restaurant ordering may sound simple: load in the menu, use chat-based AI, and you’ve done it. The actual solutions aren’t quite so easy. In fact, creating a system that works in an outdoor environment—handling car noises, traffic, other speakers—and one that has sophisticated enough speech recognition to decipher multiple accents, genders, and ages, presents immense challenges.

The co-founders of Hi Auto, Roy Baharav and Eyal Shapira, both have a background in AI systems for audio: Baharav in complex AI systems at Google and Shapira in NLP and chat interfacing.

Baharav describes the difficulties of making a system like this work: “Speech handling in general, for humans, is hard. You talk to your phone and it understands you – that is a completely different problem from understanding speech in an outdoor environment. In a drive-through, people are using unique speech patterns. People are indecisive – they’re changing their minds a lot.”

That latter issue illustrates what they call multi-turn conversation, or the back-and-forth we humans do so effortlessly. After years of practice, model training, and refinement, Hi Auto has now installed their conversational AI systems in drive-throughs around the country, and are seeing a 90% level of accuracy.

Shapira forecasts, “Three years from now, we will probably see as many as 40,000 restaurant locations using conversational AI. It’s going to become a mainstream solution.” 

“AI can address two of the critical problems in quick-serve restaurants,” comments Joe Jensen, a Vice President at Intel Corporation, “Order accuracy which goes straight to consumer satisfaction and then order accuracy also hits on staff costs in reducing that extra time staff spends.” 

Conversation Cloud for Intelligent Machines

A second groundbreaking innovation in the world of conversational AI is using a technique that turns human language into an input.

The CEO of Whitehead AI, Diwank Tomer, illustrates the historical challenges faced by conversational AI: “It turns out that, when we’re talking or writing or conveying anything in human language, we depend on background information a lot. It’s not just general facts about the world but things like how I’m feeling or how well defined something is.

“These are obvious and transparent to us but very difficult for AI to do. That’s why jokes are so difficult for AI to understand. It’s typically something ridiculous or impossible, framed in a way that seems otherwise. For humans, it’s obvious. For AI, not so much. AI only interprets things literally.”

So, how does a system incapable of interpreting nuance, emotion, or making inferences adequately communicate with humans? The same way a non-native speaker initially understands a new language: using context.

Context aware AI is building models that can use extra information, beyond the identity of the speaker or other facts. Chatbots are one area which are inherently lacking, and could benefit from this technology. For instance, if a chatbot could glean contextual information from a user’s profile, previous interactions, and other data points, that could be used to frame highly intelligent responses.

Tomer describes it this way, “We are building an infrastructure for manipulating natural language. Something new that we’ve built is chit chat API – when you say something and it can’t be understood, Alexa will respond with, ‘I’m sorry, I can’t understand that.’ It’s possible now to actually pick up or reply with witty answers.”

Tomer approaches the future of these technologies with high hopes: “Understanding conversation is powerful. Imagine having conversations with any computer: if you’re stuck in an elevator, you could scream and it would call for help. Our senses are extended through technology.”

Data Process Automation

Audio is just one form of unstructured data. When collected, assessed, and interpreted, the output of patterns and trends can be used to make strategic decisions or provide valuable feedback.

super.AI was founded by Brad Cordova. The company uses AI to automate the processing of unstructured data. Data Process Automation, or DPA, can be used to automate repetitive tasks that deal with unstructured data, including audio and video files. 

For example, in a large education company, children use a website to read sentences aloud. super.AI used a process automation application to see how many errors a child made. This automation process has a higher accuracy and faster response time than when done by humans, enabling better feedback for enhanced learning.

Another example has to do with personal information (PI), which is a key point of concern in today’s privacy-conscious world, especially when it comes to AI. super.AI has a system of audio reduction whereby it can remove PI from audio, including name, address, and social security numbers. It can also remove copyrighted material from segments of audio or video, ensuring GDPR or CCPA compliance.

It’s clear that the supportive qualities of super.AI are valuable, but when it comes to the people who currently do everything from quality assurance on website product listings to note taking at a meeting, the question is this: are we going too far to replace humans?

Cordova would say no, “Humans and machines are orthogonal. If you see the best chess players: they aren’t human or machine, they’re humans and machines working together. We know intuitively as humans what we’re put on this earth for. You feel good when you talk with people, feel empathy, and do creative tasks.

“There are a lot of tasks where you don’t feel great: tasks that humans shouldn’t be doing. We want humans to be more human. It’s not about taking humans’ jobs, it’s about allowing humans to operate where we’re best and machines aren’t.”

Voice AI is chartering unprecedented territory and growing at a pace that will inevitably transform markets. The adoption rates for this kind of tech may change most industries as we currently know them. The more AI is integrated, the more humans can benefit from it. As Cordova succinctly states, “AI is the next, and maybe the last technology we will develop as humans.” The capacity of AI to take on new roles in our society has the power to let humans be more human. And that is the best of all possible outcomes.

Speech Rec Pros Newsletter Sign-up

Need more dictation or transcription supplies and accessories?


Visit our friends over at TranscriptionGear to get the rest of what you need! From headsets to foot pedals, they have you covered.

Visit TranscriptionGear
Leave a Reply

Your email address will not be published. Required fields are marked *