Chapter 4.1: Voice to Text with Whisper
The ultimate goal of robotics is to create machines that can collaborate with humans naturally. The most natural human interface is language. To build a robot that can understand spoken commands, the first step is to convert the sound of a human voice into written text. This process is called Automatic Speech Recognition (ASR).
This chapter introduces OpenAI's Whisper, a state-of-the-art, open-source ASR system. We'll discuss how it works, how to run it, and how to integrate it into a ROS 2 system to create a node that listens for commands.
What is Whisper?
Whisper is a family of pre-trained ASR models released by OpenAI in 2022. It was trained on a massive and diverse dataset of 680,000 hours of multilingual and multitask supervised data collected from the web. This vast training data makes Whisper incredibly robust and accurate across a wide range of languages, accents, and acoustic conditions.
Key Features of Whisper:
- Multilingual: It supports transcription in dozens of languages and can even translate from those languages into English.
- Robustness: It performs well even with background noise, accents, and technical jargon, which is a significant advantage over many older ASR systems.
- Open Source: The models and the inference code are open source, allowing us to run them locally on our own hardware without relying on a cloud service. This is critical for robotics, where latency and internet connectivity can be issues.
- Multiple Model Sizes: Whisper comes in various sizes, from
tiny(39 million parameters) tolarge(1.55 billion parameters). This allows us to choose a model that fits our hardware constraints.- Smaller models are faster but less accurate.
- Larger models are more accurate but require more computational resources (ideally a GPU).
How to Use Whisper
Using Whisper in Python is remarkably simple thanks to the official openai-whisper package.
-
Installation:
pip install -U openai-whisper
# For GPU-accelerated transcription (highly recommended)
pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118You will also need
ffmpeginstalled on your system:sudo apt update && sudo apt install ffmpeg. -
Basic Transcription: The core of the library is the
whisper.transcribe()function.import whisper
# Load a model (e.g., 'base', 'medium', 'large')
model = whisper.load_model("base.en") # '.en' for English-only model
# Transcribe an audio file
result = model.transcribe("path/to/my/audio_file.wav")
# Print the result
print(result["text"])
The transcribe function handles all the complexity: it loads the audio, splits it into 30-second chunks, runs the neural network, and returns the transcribed text.
Creating a Voice-to-Text ROS 2 Node
To use Whisper in our robotics project, we need to create a ROS 2 node that can:
- Access the microphone audio stream.
- Listen for a user to speak.
- Capture the spoken audio.
- Use Whisper to transcribe the audio into text.
- Publish the resulting text to a ROS 2 topic for other nodes (like our planner) to use.
Capturing Microphone Audio
Accessing microphone data in Python can be done with libraries like sounddevice or pyaudio. The process generally involves:
- Opening an audio stream from the default microphone device.
- Continuously reading chunks of audio data (as NumPy arrays).
- Implementing a simple voice activity detection (VAD) algorithm. A basic VAD can work by monitoring the energy (root mean square) of the audio chunks. When the energy surpasses a certain threshold, we start recording. When it drops below the threshold for a certain duration, we stop recording.
The ROS 2 Node Structure
Here is a conceptual outline of our voice_commander node:
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
import whisper
import sounddevice as sd
import numpy as np
# Other necessary imports for audio handling...
class VoiceCommanderNode(Node):
def __init__(self):
super().__init__('voice_commander_node')
# Publisher for the transcribed command
self.command_publisher_ = self.create_publisher(String, 'recognized_command', 10)
# Load the Whisper model
self.model = whisper.load_model("base.en")
# Audio stream parameters
self.samplerate = 16000 # Whisper was trained on 16kHz audio
self.channels = 1
# VAD parameters
self.energy_threshold = 1000
self.silence_duration = 1.0 # seconds
# Start listening
self.start_listening()
self.get_logger().info("Voice commander node started. Say a command!")
def start_listening(self):
# This is a simplified representation.
# A real implementation would use a callback-based audio stream.
# It would continuously listen, detect speech, record it,
# and then call self.process_audio() on the recording.
pass
def process_audio(self, audio_data_np):
"""
Transcribe the captured audio and publish the command.
"""
self.get_logger().info("Processing speech...")
# Transcribe using Whisper
# Note: audio_data_np needs to be in the correct format (float32)
result = self.model.transcribe(audio_data_np, fp16=False) # fp16=False for CPU
transcribed_text = result["text"].strip()
if transcribed_text:
self.get_logger().info(f'Whisper transcribed: "{transcribed_text}"')
# Publish the command
msg = String()
msg.data = transcribed_text
self.command_publisher_.publish(msg)
else:
self.get_logger().info("Whisper could not transcribe any text.")
# main() function as before
When this node is running, it will constantly listen to the microphone. When you speak a command, it will record the audio, send it to Whisper for transcription, and the resulting text (e.g., "robot, pick up the red block") will be published on the /recognized_command topic.
Summary
OpenAI's Whisper provides a powerful, robust, and open-source solution for converting speech to text. By integrating it into a ROS 2 node that listens to a microphone, we can create the first link in our Vision-Language-Action chain: a voice interface for our robot. The text commands published by this node will be the input for the next stage of our system: the planning and reasoning LLM.