The problem with Voice Datasets

Editor's Note: This article was originally published by CMSwire, Author Erika Morphy, on October 15th, 2018

Much of what NXP Semiconductors does is build and design chips. It then puts application software on top of these chips to enable customers to do things they ordinarily might not be able to do in-house— such as develop voice recognition applications. NXP has worked with the Amazon Alexa and Google Assistant development teams to launch development kits for both platforms. In addition, the company is also working on different modules that make it easier to train smart voice devices.

NXP does not rely on a large, diverse, high-quality voice dataset to do this work, said Steven Tateosian, director of secure IoT and industrial solutions. Instead, NXP trains its voice applications in two ways:

• It works with third-party IP providers that process and creates the models for specific words

• It enables its products to work within the broader ecosystems such as Alexa Voice System or Google Assistant.

“There are companies that build these tools and create the models as their core competency,” Tateosian explained. “They will either custom create the data, or buy it or the client can send what data they have.” This route is not uncommon for companies needing datasets to train voice applications.

Testfire Labs is a natural language processing, machine learning and artificial intelligence company that is building productivity tools for the workforce, such as business meetings. It started collecting its own data before it even had an alpha of its product, said CEO Dave Damer. “Then we used datasets like the AMI Corpus, 2000 HUB5, VoxForge, and others to see if they improved our word error rate and ROUGE scores. If we found sets that benefited the model, we used those alongside our own data and started doing data augmentation, getting multiple recordings of the same meetings and corrupting the quality of other recordings to increase the size of the data set.”

Ultimately it found that the best sources were the meetings from the beta period, combined with customers who opted to allow the company to use their meeting data for continuous improvement. This route is not uncommon for companies needing datasets to train voice applications.

How blockchain works

A Rapidly Moving Space

Training products against voice datasets is a methodology that is rapidly evolving right now, Tateosian said. For that reason, companies are taking different approaches based on what their end customers’ use case is and what the business model associated with their product is. But the biggest challenge for almost all of these companies, Tateosian said, is having enough data of decent quality to create these models to train these machines.

One issue is that many companies are behind the curve compared with Amazon and Google, which have been collecting and creating large datasets of different sounds and voices for years. Google makes some of its audio datasets publicly available, Tateosian noted. “My understanding based on conversations I have heard in the market is that they are an interesting place to start, but they are not adequate if you are developing a production-level product that’s going to go on the market,” he said. “There is just not enough data, or maybe it is not of the highest quality or diversity within the dataset. I have heard similar things about other public datasets — it’s a little bit like getting the "Cliff Notes" but not reading the book.”

Create Your Own Dataset

Another approach to finding the necessary data to train a product is to create your own dataset — a task that can be outsourced to companies, which NXP does. Company business models very much dictate the approach to this. Some of the outsourcers, for example, have a few thousand people kept on retainer who can say words or phrases in different ways with different accents. These words are then added to their already growing dataset.

Companies can build their own datasets from scratch if they are inclined, Tateosian said. “There is also a huge amount of audio data available online, such as with YouTube. The challenge is that it is not categorized in any particular way. YouTube isn’t set up, at least publicly, for someone to search for specific types of audio and then to be able to abstract that and create a model.”

Call centers are another source of datasets, he added. “Remember the frustration of 10 years ago, trying to talk to an automated call center. They have been learning and improving by recording and collecting all those calls. They have all the data within the context of whatever helpline that they are running.”

Design Principles

Whether you build your own dataset or use publicly available ones, there are some important design criteria to keep in mind. “Ideally you need to support the mainstream business languages that your customers speak,” said Ed Price, director of compliance at Devbridge. “Take a page from the big boys' book on the block Amazon, Apple, Google, Microsoft — they don't support all languages/voices with their digital assistants. Unless you have a specific customer need or your business depends on it, then don't.”

Also, Price advised that the company needs to focus on the jargon, abbreviations and industry speak that might be used by its customers in the voice user interface. He suggests conducting a design heuristics study of the customers and how they ask questions. “You could set up a prototype with Alexa or other device and capture the asks (speech-to-text) and then use that data to manage your features,” he said.

Voice-first application design and engineering

Voice-first application design and engineering

Download white paper