Conversation Design & Multi-modal feedback in a Voice first device
Conversation Design | Navigation | Multi-Modal feedback
Miko is a brand of educational robots designed to engage children through interactive conversations, storytelling, and educational games.
​
Miko Mini is designed for children aged 7 -10 years and engages them through conversation, storytelling and educational games.
Team
-
UX Lead – Sharang
-
Visual Design – Dakshita
-
Product Management – Akshat Adani
-
Engineering Lead - Omkar
​
Table of Contents
-
Conversation IA & Components
-
Navigation & System States
-
Multi-Modal Feedback

Difference between Voice-first & Voice based devices
Voice-first devices like Miko Mini prioritizes voice interaction as the main way for users to initiate actions and receive responses. Other examples in this cateogry include smart speakers like Amazon Echo and Google Nest.
In contrast, Voice-based devices such a smartphones and tablets may incorporate voice functionality but also include other input methods, such as touch or visual interfaces.
Essentially, all voice-first devices are voice-based, but not all voice-based devices are considered voice-first.

youtube.com/@bensonvan3044/
Design Research

Conversation Design Flow & Components
This is a comprehensive framework for handling user interactions, providing appropriate feedback, and managing various error scenarios to maintain a smooth user experience

User- driven triggers
User-triggered voice navigation allows users to interact with Miko Mini in various ways
a. Wake word trigger activates when a user says a specific phrase, such as ‘Hey Miko,’ to initiate a conversation.
b. Explicit voice commands: After the wake word, users issue specific commands or questions. For example, "Tell me story" or "Play a song."
c. Follow-up prompts: Some systems allow continued conversation after an initial trigger, without repeating the wake word. For instance, after asking "Tell me a story" the user might follow up with "Play story of red riding hood" without needing to say the wake word again.

System- driven triggers
System-triggered voice navigation initiates actions based on pre-determined conditions or events.
These conditions could be time based events, first user session of the day or certain system value decreasing or increasing from the threshold-value etc.
This automated approach can include tasks like slot-filling in an ongoing conversation, providing content notifications, executing scheduled commands, enhancing efficiency or exercising robot personality.

Error & Edge-cases Handling Framework
Various error conditions including server unavailability, internet connectivity issues, ASR failures, and backend processing problems.
-
"Server Down Fallback": Handles server unavailability
-
"No Internet Fallback": Manages situations without connectivity
-
"Reprompt Fallback": Likely asks users to repeat their input
-
"Close Fallback": Gracefully ends interactions when recovery isn't possible
Consecutive Error Counter: Tracks repeated errors with specific handling when three or more consecutive errors occur
​
Some of the common edge cases which are critical for ensuring a robust and user-friendly conversational experience
-
Invalid Response An invalid response occurs when the user's input cannot be understood or processed by the system due to ambiguous language, incomplete information, or out-of-scope queries.
-
No response This scenario occurs when the user does not provide any input after initiating interaction
-
Profanity involves inappropriate or offensive language used by the user during interaction
-
Gibberish input refers to nonsensical or meaningless text or speech provided by the user, such as random characters, incoherent phrases, or intentionally confusing responses

Speech Processing Components
Speech Detection: After wake word activation, the system monitors for user speech input.
ASR (Automatic Speech Recognition) converts spoken language into text, with potential for errors or timeouts as shown in the error handling paths.
Filler Statements: Verbal responses used to maintain engagement while the system processes information, holding backend responses until completion
Keyword spotting and response types
The system recognises a voice commands by Keyword spotting. Keyword spotting operates in the background and utilizes NLP techniques to process user input (utterances) to identify specific words or phrases known as intents and entities.
Based upon entity identification, the system can respond by serving the request or surfacing the Recommendation system or Slot filling.
A more comprehensive explanation of the concept is available here Conversation systems & Maxims.

Response Management
Response Timing: The system has a 2.5-second (fixed) threshold for response delivery, with different paths for responses received within or beyond this timeframe.
​
Thinking Mode: Visual feedback (LED change + Eye expression change) indicating the system is processing the request.
​
Thinking Extended Mode: Cover for Latency
​
Recommendation System: Features "Lingual audio + Daily Adventure Push" which appears to suggest content or activities based on user interactions
Multimodal Feedback
In any conversation experience be it an app, robot or game experience — voice is rarely used in isolation.
Effective system feedback is essential to offer users context, clarity and guidance during voice interactions.
Users benefit from knowing
- when they can start speaking,
- when the system is listening,
- when to take action based on their input,
- or how they are informed about system errors or fallbacks.
Other interfaces are also leveraged, such as physical buttons, touch screens, sensors, images, sounds, video or physical motion.
More details in the last section

System Navigation
Since Miko Mini is a voice first device, the primary mode of interaction is voice trigger which transitions the system from one state to another. The user can use a wake word trigger "Hey Miko" and followup prompt like "Play Riddles", "Increase Brightness", "Call Parent" to navigate across functionalities.

System state overlaps
In voice-first devices, the transition between system states is not always linear; certain states can take precedence over others.
For example, when the system is in a Listening state, it may either disregard or respond to incoming calls from a parent, depending on its current operational status.
This dynamic interplay between states is crucial for ensuring effective interaction, as the system must determine how to prioritize various inputs in real-time.

Multi-modal feedback
Miko Mini employs multi-modal feedback mechanisms to enhance user interactions and provide a seamless experience. These mechanisms include LED states, GUI interface, motion, SFX (sound effects), and voice-over
GUI: The graphical user interface (GUI) displays intermediate states, voice skill thumbnails, and system errors. It helps users understand the robot's current status and interact effectively
​
Voice-over: Voice-over provides clear auditory feedback and guidance to users. It ensures accessibility and enhances user understanding of system states or errors.
When the battery is low: “My battery is running low, please recharge me soon.”
​
LED States: The LED system provides visual cues based on the robot’s status or activity. Different colors and animations are used to convey specific information.
Motion: Miko Mini uses dynamic facial expressions and physical movements to make interactions more engaging. These motions complement other feedback modes.
SFX (Sound Effects) Sound effects are integrated to reinforce feedback and add personality to interactions. For example, a cheerful chime when a voice skill is activated.
​




Learnings & Takeways
Designing for any voice-first device involves understanding voice-based triggers, system states, and the importance of multimodal feedback.
As technology continues to evolve, so too do the design principles guiding these innovations, which remain a work in progress.
By prioritizing voice interaction and optimizing user experience, we can create engaging and educational tools for children. Embracing these evolving principles will lead to even more intuitive devices that cater to young users’ needs.
Related resources
For further insights, check out my other two articles on:
- Conversation design principles Elements of Conversation system for crafting voice experiences for chat interfaces, voice-first devices, and social robots
- GPT powered prototyping template for conversation design
Further related reading on this topic
- Conversation design by Erica Hall
- Intents and entities in chatbots
- 10 Essential Chatbot Analytics Metrics to Track Performance