Posted on

How this engineer is improving speech recognition for young tech users

By Michelle Wheeler for Create

Researchers are hoping to capture the voices of hundreds of Australian children in a bid to improve speech recognition for young tech users. 

Until now, the speech recognition software behind virtual assistants like Google Assistant, Alexa and Siri has relied on a database of adult voices. But AusKidTalk — a new joint project of five universities — aims to change that.

The project’s team of engineers, linguists, psychologists and speech pathologists are creating a unique database of Australian kids’ voices, and they say the benefits could extend to new learning and speech therapy tools for children.

University of New South Wales electrical engineer Dr Beena Ahmed researches the use of signal processing to understand speech.

She’s been studying children’s speech for more than a decade, coming up with new tools and technology to help Australian kids.

“The biggest issue that I have faced in my research is that we don’t have children’s speech databases that we can use in developing these tools,” Ahmed said.

“There are huge databases for adult speech but for children’s speech, there are not that many databases around the world. And with speech, accents make a huge difference … so something built for American accents might not necessarily work well with Australian ones.”

Unable to find a database of children’s voices to support her research, Ahmed decided to build one. She reached out to like-minded researchers in Sydney and Melbourne, and AusKidTalk was born.

Ahmed and her fellow engineers will use the database to develop new speech recognition systems for younger users. Linguists and psychologists, meanwhile, will use it to better understand how children develop their speech and language. 

Ahmed says young children can struggle with consonant clusters — so clown becomes cown and brick becomes bick. And certain sounds, such as and th typically don’t come until children are five or six years old. 

Children also don’t have a fully developed vocabulary and may not use correct grammar or sentence structure — something that Ahmed says technology doesn’t recognise.

One of the goals of AusKidTalk is to collect different kinds of speech. The researchers are aiming to record 750 Australian children between the ages of three and 12, including 50 with disordered speech. 

It sounds like a lot, but Ahmed says it’s not a huge amount of data for building speech recognition algorithms.  

“To develop a really robust model, you need thousands and thousands of hours of speech,” she said.

“We’re only getting 20 to 30 minutes per child.”

The team is having to develop techniques to get the most out of the limited speech they’re able to collect. One is known as “domain adaptation”. 

“Say you have a model for American speech, or adult speech, and then you use the children’s speech to improve the model so it works better with children’s speech,” Ahmed said.

“From an engineering perspective, it’s the AI algorithms that will be our focus.”

Right now, one of the biggest engineering challenges is in the annotation. 

“Once you’ve developed something, you then need to manually validate it, to make sure it’s doing the correct annotation,” Ahmed said. “So it’s a lot of cost involved.

“Then to train those annotation tools, we need Australian speech somehow, which we don’t have already anyway. It’s sort of a Catch-22.”

Ahmed is also looking at recognising emotion in children’s speech — something that could be used to triage phone calls to kids’ helplines, for instance. And since going public with their plans, the AusKidTalk team has been approached by commercial companies interested in accessing their database of recordings.

“That’s something we’re still discussing,” Ahmed said. 

“At the moment, the major priority is our own research.” 

Building a database

Ahmed said the different speech samples AusKidTalk collects will first be evaluated and categorised. 

“At the moment, we’re developing automated annotation tools so that we can mark what is said where,” she said. 

“Then the next step is actually developing some algorithms to recognise some of the common sounds.”

The algorithms could feed into applications like automated reading or speech therapy tools. 

“In our algorithms, we develop strategies to perhaps use the new data with existing data and adapt what we call ‘acoustic models’ … which model the individual sounds and speech,” she said.

“Then we have language models as well … for the individual words.”

With children’s speech changing so quickly, the researchers will divide the recordings into four age groups for their initial analysis. 

Eventually, the recordings will be combined.

Need more dictation or transcription supplies and accessories?

Visit our friends over at TranscriptionGear to get the rest of what you need! From headsets to foot pedals, they have you covered.