top of page

Conversation Design & Multi-modal feedback in a Voice first device 

Conversation Design | Navigation | Multi-Modal feedback

Miko is a brand of educational robots designed to engage children through interactive conversations, storytelling, and educational games.
​
Miko Mini is designed for children aged 7 -10 years and engages them through conversation, storytelling and educational games.
 

Team

  • UX Lead – Sharang

  • Visual Design – Dakshita

  • Product Management – Akshat Adani

  • Engineering Lead -  Omkar

​

Table of Contents

  • Conversation IA & Components

  • Navigation & System States

  • Multi-Modal Feedback

Miko+girl.png

Difference between Voice-first & Voice based devices

Voice-first devices like Miko Mini prioritizes voice interaction as the main way for users to initiate actions and receive responses. Other examples in this cateogry include smart speakers like Amazon Echo and Google Nest.

In contrast, Voice-based devices such a smartphones and tablets may incorporate voice functionality but also include other input methods, such as touch or visual interfaces.

Essentially, all voice-first devices are voice-based, but not all voice-based devices are considered voice-first.

youtube.com/@bensonvan3044/

Design Research

Frame 131.jpg

Conversation Design Flow & Components

This is a comprehensive framework for handling user interactions, providing appropriate feedback, and managing various error scenarios to maintain a smooth user experience

Sharang - All projects - Listening Experience + Recommendation + No Internet_ Backend Erro
User- driven triggers

User-triggered voice navigation allows users to interact with Miko Mini in various ways
 

a. Wake word trigger activates when a user says a specific phrase, such as ‘Hey Miko,’ to initiate a conversation.
 

b. Explicit voice commands: After the wake word, users issue specific commands or questions. For example, "Tell me story" or "Play a song."
 

c. Follow-up prompts: Some systems allow continued conversation after an initial trigger, without repeating the wake word. For instance, after asking "Tell me a story" the user might follow up with "Play story of red riding hood" without needing to say the wake word again.

System- driven triggers

System-triggered voice navigation initiates actions based on pre-determined conditions or events.

These conditions could be time based events, first user session of the day or certain system value decreasing or increasing from the threshold-value etc.

This automated approach can include tasks like slot-filling in an ongoing conversation, providing content notifications, executing scheduled commands, enhancing efficiency or exercising robot personality.

Error & Edge-cases Handling Framework

Various error conditions including server unavailability, internet connectivity issues, ASR failures, and backend processing problems.
 

  • "Server Down Fallback": Handles server unavailability

  • "No Internet Fallback": Manages situations without connectivity

  • "Reprompt Fallback": Likely asks users to repeat their input

  • "Close Fallback": Gracefully ends interactions when recovery isn't possible
     

Consecutive Error Counter: Tracks repeated errors with specific handling when three or more consecutive errors occur

​

Some of the common edge cases which are critical for ensuring a robust and user-friendly conversational experience
 

  • Invalid Response An invalid response occurs when the user's input cannot be understood or processed by the system due to ambiguous language, incomplete information, or out-of-scope queries.
     

  • No response This scenario occurs when the user does not provide any input after initiating interaction 
     

  • Profanity involves inappropriate or offensive language used by the user during interaction
     

  • Gibberish input refers to nonsensical or meaningless text or speech provided by the user, such as random characters, incoherent phrases, or intentionally confusing responses

Screenshot 2025-03-26 at 10.25.27 PM.png
Speech Processing Components

Speech Detection: After wake word activation, the system monitors for user speech input.
 

ASR (Automatic Speech Recognition) converts spoken language into text, with potential for errors or timeouts as shown in the error handling paths.
 

Filler Statements: Verbal responses used to maintain engagement while the system processes information, holding backend responses until completion

Keyword spotting and response types

The system recognises a voice commands by Keyword spotting.  Keyword spotting operates in the background and utilizes NLP techniques to process user input (utterances) to identify specific words or phrases known as intents and entities.

 

Based upon entity identification, the system can respond by serving the request or surfacing the Recommendation system or Slot filling.

 

A more comprehensive explanation of the concept is available here Conversation systems & Maxims.

Response Management

Response Timing: The system has a 2.5-second (fixed) threshold for response delivery, with different paths for responses received within or beyond this timeframe.

​

Thinking Mode: Visual feedback (LED change + Eye expression change) indicating the system is processing the request.

​

Thinking Extended Mode: Cover for Latency

​

Recommendation System: Features "Lingual audio + Daily Adventure Push" which appears to suggest content or activities based on user interactions

Multimodal Feedback 

In any conversation experience be it an app, robot or game experience — voice is rarely used in isolation.

Effective system feedback is essential to offer users context, clarity and guidance during voice interactions.

Users benefit from knowing
-  when they can start speaking,
- when the system is listening,
- when to take action based on their input,
- or how they are informed about system errors or fallbacks.


Other interfaces are also leveraged, such as physical buttons, touch screens, sensors, images, sounds, video or physical motion.


More details in the last section

Screenshot 2024-04-08 at 2.39.38 PM.png

System Navigation

Since Miko Mini is a voice first device, the primary mode of interaction is voice trigger which transitions the system from one state to another. The user can use a wake word trigger "Hey Miko" and followup prompt like "Play Riddles", "Increase Brightness", "Call Parent" to navigate across functionalities.

Miko mini Voice I.A 1.jpg
System state overlaps

In voice-first devices, the transition between system states is not always linear; certain states can take precedence over others.

For example, when the system is in a Listening state, it may either disregard or respond to incoming calls from a parent, depending on its current operational status.

This dynamic interplay between states is crucial for ensuring effective interaction, as the system must determine how to prioritize various inputs in real-time.

1_AR5dsPxF0Z1eH7DBpsa_xg.webp

Multi-modal feedback

Miko Mini employs multi-modal feedback mechanisms to enhance user interactions and provide a seamless experience. These mechanisms include LED states, GUI interface, motion, SFX (sound effects), and voice-over

GUI: The graphical user interface (GUI) displays intermediate states, voice skill thumbnails, and system errors. It helps users understand the robot's current status and interact effectively

​

Voice-over: Voice-over provides clear auditory feedback and guidance to users. It ensures accessibility and enhances user understanding of system states or errors.

When the battery is low: “My battery is running low, please recharge me soon.”

​

LED States: The LED system provides visual cues based on the robot’s status or activity. Different colors and animations are used to convey specific information.

Motion: Miko Mini uses dynamic facial expressions and physical movements to make interactions more engaging. These motions complement other feedback modes.
 

SFX (Sound Effects) Sound effects are integrated to reinforce feedback and add personality to interactions. For example, a cheerful chime when a voice skill is activated.

​

SLide7-1.jpg
SLide8-1.jpg
Learnings & Takeways

Designing for any voice-first device involves understanding voice-based triggers, system states, and the importance of multimodal feedback.

 

As technology continues to evolve, so too do the design principles guiding these innovations, which remain a work in progress.

 

By prioritizing voice interaction and optimizing user experience, we can create engaging and educational tools for children. Embracing these evolving principles will lead to even more intuitive devices that cater to young users’ needs.

Related resources

For further insights, check out my other two articles on:
Conversation design principles Elements of Conversation system for crafting voice experiences for chat interfaces, voice-first devices, and social robots

GPT powered prototyping template for conversation design

Further related reading on this topic

Conversation design by Erica Hall
Intents and entities in chatbots
10 Essential Chatbot Analytics Metrics to Track Performance

portfolio UX product design zeta Ideo startup UX bangalore

© 2023 Sharang Sharma. All rights reserved. This is a personal portfolio and the logo is registered trademark by the owner Sharang Sharma.

bottom of page