Voice-first application design and engineering

How companies can inject voice user interface into their customer experience with a practical engineering approach

Download white paper

How to approach design and engineering in the voice-first era

Making voice a part of your digital strategy starts with appreciating the complexity of human language. Every digital voice experience is taking the place of what was once a conversation.

QUESTIONSGOALS
Does your navigation rely on heavily on search capabilities?Users will soon expect voice search on apps and devices.
Do user searches involve complex or tiered queries?Voice can be the most efficient mode of search for these queries.
Is your app used on a small-screen device or wearable?Voice can be the most efficient interface for these devices.
Do users engage with your app or device on the go?Voice is ideal for engaging in a hands-free environment.
Do users need detailed customer insights or analytics?Voice-based analytics will provide powerful new customer insights.

At its annual I/O Conference, Google debuted Duplex—a truly conversational VUI that is so finely tuned to a specific set of nuances of language, it can phone a restaurant and book a reservation for you. The sophistication of Duplex is an outlier. There are more approachable ways to find success.

To figure out how resilient your VUI needs to be, start by identifying the common questions it is likely to encounter. This does not require a large investment. List out potential questions on notecards and arrange them on a table to reflect conversational threads. Structure planned conversations and identify ways they could go off track. Next, rank these conversations by the utility or benefit they would provide and assess the overall complexity behind the conversational goals.

FIGURE 3A: Identify goals, then the questions that address them

identify goals Velocity

The complexity lies not only in the number of conversational branches, but also the number of ways to get at a question. In the case of a financial planning application, this could mean, “How is this stock performing?” or “How is this stock performing over the last three months?” Prioritize the questions worth investment, then validate by using research and analytics to identify questions or paths that might veer outside of the planned conversation. Regular conversation testing, using tools like Voxterity VUI Design Studio and Sayspring, supports rapid iteration through development.

Conversation has both audible and visual elements. For a VUI, it’s important to also consider the visual feedback and conversational tone.

Empathy is at the core of quality design and successful conversation. Consider these different verbal interactions: “Could you please pick up bread at the store?” said while smiling has an entirely different feel than, “Hey, grab some bread tonight! We’re out!” shouted as someone rushes out the door. Both communicate the same request but have very different subtexts. The chosen tone should approach the conversation from an angle that is aligned with the purpose of the application and the messaging language of the larger organization.

Visual feedback confirming a device is listening or comprehending is crucial. Without visual cues on a screen or other indicators on the device, users may find themselves shouting at a heap of wires and plastic they believe isn’t listening, resulting in a frustrating user experience. These potential frustrations must be carefully considered when designing the user experience. If a voice interaction is focused and successful from the start, user adoption will quickly follow. When responsive website design first took off, the biggest mistake was trying to make the big screen scale down to the little ones. These learnings were hard-earned, but now mobile-first design is the default when creating a new digital experience. The same goes for voice. Attempting to create a catch-all solution that can sing any tune will result in the final solution is a bit off-key.

Instead, be incredibly focused on a particular task or conversational path, prove value, and iterate from there. Constantly review how many steps are required to get an answer and look for a faster way. Reduce to the simplest conversational flow, and when onboarding, make it explicitly clear what is expected from the person using the interface.

Seven habits for highly successful VUI design

  1. Understand: Have goals for the interaction and the outcome

  2. Empathize: Put yourself in the user’s shoes and appreciate what they are trying to achieve

  3. Simplify: Deliver information succinctly, avoid excess detail

  4. Test: Embrace prototyping to quickly validate assumptions

  5. Measure: Define metrics from the start and remember that not all tasks need voice interaction

  6. Deliver: Ship the smallest, most meaningful element fast

  7. Prioritize: Develop additional skills based on feedback

“Making voice a part of your digital strategy starts with appreciating the complexity of human language. Every digital voice experience is taking the place of what was once a conversation."

Another thing to remember is that people tend to be more verbose than conversational when writing. Iterate through your voice interactions by practicing them aloud with others. Record the conversation and listen to it later. Play it back throughout development. Every amount of production value you bring to the process will be realized in development.

Your design team should also help the business and engineering teams keep the essentials in mind when preparing the organization for voice-first design.

  • Begin by tuning your design stack for voice. Several new tools are available that were specifically created to aid artificial personality design. These frameworks and tools can abstract the nuances of digital assistants from different vendors and let you code 70-80 percent of an application that will work across platforms. The remaining 20-30 percent can then be coded to the digital assistant specs. This process is similar to how Cordova or React Mobile helps development for iOS and Apple with 80-90 percent code reuse.

  • Manage users’ expectations while still creating an engaging interaction. Understand what’s underneath the hood. A general comprehension of machine learning is paramount in order to devise and design human-interfacing AI.

  • Product personality should come out in the design, not in the agent. Chances are there are distinct expectations or a formulated style guide for the conversational agent’s persona to adhere to. Instead, focus on function and the personality will naturally evolve through the product build.

  • Always remember that less is more; even less than less is best. Creating dialogue that is as brief as possible while still conversational and true to the brand’s voice is a tall order. After hearing the initial drafts of dialogue uttered by your AI agent for the first time, you’ll undoubtedly realize that even your shortest responses may be too long. Iterate, learn, measure, and build until you get it right before launching your minimum viable product (MVP).

  • Quality is paramount. Go pro when it comes to the voice-over. Substandard recording environments and equipment can introduce distracting artifacts, noise, and distortion to the recorded voice, all of which thwarts usability and hinders adoption.

FIGURE 3B: U.S. households with smart home devices (in millions)

smart voice devices Velocity

How does a VUI work?

At the simplest level, it all starts with some kind of smart speaker that has a built-in microphone for receiving commands. Since we live in a world of privacy concerns, manufacturers have built these devices to sleep until the user issues a “wake up” command followed by an “utterance.” The utterance is usually a word or phrase that the device’s back-end technology converts to a web-services query.

The diagram in figure 4 shows the interaction of components and the steps taken to obtain an answer once a request is made. This type of VUI is commonly referred to as a “Conversational System.” Besides text / voice, conversational systems enable people and machines to use other modalities like sight, sound, and touch to communicate across the digital device mesh. This intelligent digital mesh requires changes to the architecture, technology, and tools used to develop these solutions.

The mesh app and service architecture (MASA) is a multichannel solution architecture that leverages cloud and serverless computing, containers, microservices, APIs, and events to deliver dynamic solutions. These solutions ultimately support many users in different roles, all using multiple devices and communicating over multiple networks. However, MASA is a long-term architectural shift that requires significant changes to development tooling and best practices.

FIGURE 4: Steps in a voice interaction request

voice interaction Velocity

Here’s the good news: most of this is already built into the latest digital assistants, so half the work is already done. The hard part is designing the actual user experience and making the conversation clean and easy to use. Thankfully, all of the most popular digital assistants use the same basic sentence structure which looks something like this:

  • Wake word: Wakes up the device to receive a request. Usually the wake word is the name of the digital assistant (i.e. Alexa, Computer, Google, Siri, etc.). A wake word needs to be followed by a launch word.

  • Launch word: This is a command such as “ask,” “order,” “find,” “tell,” or “dial.”

  • Invocation word: This tells the digital assistant what application to use. If the assistant has been set up to use a default app for certain requests, the invocation word can be omitted.

  • Utterance word/phrase: This is the specific desire that the user wants fulfilled. (i.e. “Tell me the latest movie releases.”, “Find a ride home.”, or “Give me my flash report.”

The voice is then translated through Natural Language Processing (NLP) into text which is associated to a command in the voice system’s catalog of skills. When an association is found, the system uses a web service along with the default settings from the user’s account preferences to execute the command. Most devices with a speaker will generate a synthesized text-to-speech response for the user. In some cases, a visual display response providing additional information will accompany the voice response.

Continue to:Building a voice skill