Voice-first application design and engineering

How companies can inject voice user interface into their customer experience with a practical engineering approach

Download white paper

Building a voice skill

A voice skill is similar to a web application in that it has a front end and a back end. When interacting with a skill, voice-enabled hardware is the front end. Front-end developers for voice skills create code for an interface using a voice developer console. Amazon, for instance, has a voice developer console for its Alexa assistant. 

This front-end programming is merely setup. In order for a voice-first interface to deliver an appropriate response, back-end code is needed as well. A skill’s back end can be hosted on most any HTTPS web service. When starting out, the simplest approach is to use a software development kit (SDK) or framework to write the code and then host it on a serverless cloud like Amazon Web Services (AWS) Lambda.

Another thing to remember is that while each digital assistant platform has its own lexicon to describe how it works, all vendors use the same basic process to fulfill a user request—a “wake word” (also known as an utterance), followed by a “launch word” (also known as an intent) which triggers the appropriate services through an event-driven programming model. Below is a table of the most popular digital assistants and the nomenclatures they use for each step in the process. 

Voice services providerDeveloper kitsUtterance typeInteraction with appServices hookResponses
Amazon AlexaAlexa Skills KitSkills/invocationsIntents/slotsHeadless function/ LambdaRenderTemplate cards
Apple SiriSiriKitDomains/intentsDomains/intentsWeb APIsResponse
Microsoft CortanaCortana Dev CenterInvocationsEntities/intentsAzure functionResponse
Google AssistantGoogle Assistant KitActions/ invocationsConversationsWeb-hook/ firebasesFulfillment

Case study: Voice-enabling of the PowerUp app

Recently, Devbridge extended the functionality of the PowerUp application by adding a voice interface. PowerUp is a proprietary project health reporting tool used to share key performance indicators with clients in real-time. 

The PowerUp application was chosen because its architecture and purpose naturally lent itself to voice interaction. It was easy to envision customers using it on the way into the office by asking their digital assistant for a product development update.

The Devbridge development teams began by carding out the user interaction to help define the way in which a voice platform interaction would differ from existing mobile/web application interaction. The illustration in figure 5 shows the high-level invocation, utterance, and response flow. The goal of this design and implementation was to get the voice functionality up and running quickly, not to offer every function that was in the PowerUp application—in other words, create a minimum viable product (MVP). 

This approach aligns with the agile development process, in which short iterations of design and development are validated with user feedback. 

This process ensures any new voice features are necessary and provide value to the user. The main functions of the voice interface allow users to retrieve overview, budget, scope, and schedule information retrieval using voice commands.

The PowerUp voice implementation began with Amazon’s Alexa because it is the most widely adopted digital assistant. It was also easy to adapt the PowerUp application to this platform because its architecture uses a RESTful web services layer and a minimized UI design. This allowed the design and development teams to focus efforts on the voice prompts and utterances.

FIGURE 5: The invocation, utterance, and response workflow

For the implementation, the teams leveraged AWS Lambda serverless services. This involved creating a Lambda service layer called “dbVoiceSkills” which would be invoked from the Alexa Skills Kit event handlers. The Alexa Skills event handlers manage whatever event is raised by an Alexa digital assistant invocation call. Within the Alexa Skills event handlers, the team created the Invocation word “PowerUp.” This instructs the Alexa Skills Kit to call the defined Lambda Service—dbVoiceSkills. For each intent or ask that the PowerUp voice interface supports, the team instructed the Alexa Skills engine which words to listen for. 

The illustration below shows the Alexa Custom Skills menu. It shows the four custom intents and the default functions built into Alexa. In the future, PowerUp teams will add slots for customers and internal end-users who have more than one assigned or active project so they can select from their list of projects.  

This example uses Alexa, but the same type of configuration and training pattern would be used to do this with other digital assistance platforms. 

There’s also a digital assistant AI layer in which you tell the assistant which skills or intents your app will support, as well as a service layer which obtains the data and creates the verbal and screen reponses. Much like mobile development, each voice interface is vendor-specific. There are frameworks out in the marketplace, but the PowerUp teams chose to code each independently because many of these frameworks are not yet at a mature state.

To make the Alexa Skills work, the developer connected them to the Lambda service, as shown in the end-points section of the Alexa Skills menu. After the Lambda service is chosen, it processes each event and maps them to the services that prepare the responses from the digital assistant—spoken and, if the device is capable, displayed. Additionally, within the Lambda service, you can use a Skill ID to protect the Lambda service from being used by others. The IDs in the PowerUp example have been blurred for this reason.

The remainder of the software engineering requires coding the Lambda services to support each of the intents. The code needs to follow an event model and be async, which can be an issue if you are making outbound calls to RESTful services and have to wait for the results of a call to feed parameters to the next call. If this is the case, you can force-sync the code that is blocking the process—a programming no-no. If you choose this route, you need to keep it short or you should retool your utterance cards to limit call dependencies.   

The concept of session and state engine can also be used. The PowerUp engineering teams are working to add Lambda DynamoDb to optimize the performance of the digital assistant’s verbal and visual responses. This is accomplished by storing preferences and keys to data rather than calling services with each intent request from the end-user.

Start to finish, software engineering for this minimal functional prototype implementation required approximately 80 hours. It is estimated that adding session persistence and all the features in the PowerUp app would take an additional 40-80 hours of work. The next step for PowerUp will be to round out the functions and then add support for the rest of the most popular digital assistants. This quickly built prototype, however, illustrates that if an organization is looking to make voice a new interface for your customers, it’s not out of reach or budget.

Identifying and validating voice-enabled opportunities

Voice opportunities are reminiscent of the emergence of mobile shopping in the last five years: first the industry reacts defensively, rejecting the likelihood of mainstream adoption. Shortly thereafter a trend is identified, market leaders invest in research and development to capitalize on the competitive advantage. And lastly, a trend becomes the new status quo, while naysayers go out of business. 

Forrester research has found that in 2012, U.S. consumers spent $7.8 billion in retail purchases on their smartphones. By 2016, this figure had grown to $60.2 billion and Forrester anticipates it will reach $93.5 billion in 2018—and $175.4 billion in 2022.

The reason a comparison is made with mobility and mobile e-commerce is because voice-enabled applications have similar patterns of usage: hyper-contextual, just-in-time, and performance augmenting. For example, the average commute time has been growing in the U.S. for the last 20 years, sitting at 27 minutes one way, as of 2018. 

A financial advisor leverages a voice-enabled mobile application to get started on his daily tasks: review changes in portfolio positions, delegate tasks in the CRM to connect with specific clients, and review content that has been prepared by the platform. 

A manufacturer leverages voice control to augment an engineer while a complex machine is being operated with both hands—enabling the engineer to double check parameters of the work order without stopping work. A field service worker is able to query the next generation of Google Glass for specific parameters on a valve adjustment at a nuclear power plant—without the need to retrieve a tablet, remove gloves, and so forth. A long-haul driver uses voice control to communicate with dispatch—from acknowledging a new load pickup, to adding commentary on an ongoing delivery.

These are real-world examples of VUI projects we’re working on at Devbridge. There are so many opportunities to use voice to create a competitive advantage. The time for voice is now. 

Where to start?

Knowing what type of user interface customers prefer is paramount. Many learned the hard way when they resisted or ignored the shift in user preferences toward mobile devices. Ignoring the rising use of voice as an interface is just as risky. Fortunately, there is no better time to adapt to this new stage of UI evolution. Best practices have been established and development tools are in place. Most companies just need a development partner who can help them identify where investment in VUI technology makes the most sense for their digital portfolio.