Chapter 4.2: Planning and Reasoning with LLMs
Once we have a text command from Whisper, such as "bring me the red block from the table," the robot needs to understand the intent behind the words and translate it into a sequence of concrete actions. This is where Large Language Models (LLMs) like GPT, Llama, or Gemini come in.
By leveraging the incredible reasoning capabilities of LLMs, we can move beyond simple, hard-coded commands and enable our robot to understand complex, natural language instructions. This chapter explores how to use an LLM as a "task planner" within our ROS 2 architecture.
The LLM as a Planner
A traditional task planner in robotics often requires a highly structured world model and complex algorithms to generate a sequence of actions. An LLM, on the other hand, can function as a zero-shot or few-shot planner, using its vast world knowledge and reasoning ability to break down a high-level command into logical steps.
Our goal is to create a ROS 2 node that:
- Subscribes to the
/recognized_commandtopic to receive text from our Whisper node. - Takes the text command and inserts it into a carefully crafted prompt.
- Sends this prompt to an LLM API (like OpenAI's or a locally-run model).
- Parses the LLM's response to extract a structured plan.
- Publishes this plan to another ROS 2 topic for the action execution node to handle.
Prompt Engineering: The Key to Success
The quality of the plan generated by the LLM depends almost entirely on the quality of the prompt. This is the art of prompt engineering. Our prompt needs to give the LLM enough context to understand its role and the constraints of its world.
A good prompt for a robot planner should include:
-
The Role/Persona: Tell the LLM what it is.
"You are a helpful robot assistant. Your task is to break down high-level user commands into a sequence of simple, executable actions."
-
The Available Actions: Explicitly list the primitive actions the robot can perform. This is the most critical part, as it constrains the LLM's output to things the robot can actually do.
"You can only use the following functions:
go_to(location): Navigate to a named location (e.g., 'table', 'charging_dock').find_object(object_name, color): Visually search for an object.pick_up(object_id): Pick up a previously found object.place_on(location): Place the held object on a location.respond(text_to_say): Speak a response to the user."
-
The Output Format: Instruct the LLM to provide its response in a structured format, like JSON. This makes the output reliable and easy to parse in our code.
"Your output MUST be a JSON array of objects, where each object has a 'function' and a 'parameters' key. Do not produce any other text."
-
The User Command: The final part of the prompt is the command itself.
Example Prompt:
You are a helpful robot assistant. Your task is to break down high-level user commands into a sequence of simple, executable actions.
You can only use the following functions:
- `go_to(location)`: Navigate to a named location (e.g., 'table', 'charging_dock').
- `find_object(object_name, color)`: Visually search for an object.
- `pick_up(object_id)`: Pick up a previously found object.
- `place_on(location)`: Place the held object on a location.
- `respond(text_to_say)`: Speak a response to the user.
Your output MUST be a JSON array of objects, where each object has a 'function' and a 'parameters' key. Do not produce any other text.
User command: "robot, please fetch the green apple for me"
Expected LLM Output:
[
{
"function": "go_to",
"parameters": ["table"]
},
{
"function": "find_object",
"parameters": ["apple", "green"]
},
{
"function": "pick_up",
"parameters": ["object_id_from_find_object"]
},
{
"function": "go_to",
"parameters": ["user"]
},
{
"function": "respond",
"parameters": ["Here is the green apple you asked for."]
}
]
(Note: Handling object_id requires state management between steps, which we'll cover in the execution chapter.)
The ROS 2 Planner Node
Here is a conceptual outline of the LLM planner node.
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
import json
import openai # Or another LLM library
class LLMPlannerNode(Node):
def __init__(self):
super().__init__('llm_planner_node')
# Subscribe to the recognized command topic
self.subscription = self.create_subscription(
String,
'recognized_command',
self.command_callback,
10)
# Publisher for the generated plan
self.plan_publisher_ = self.create_publisher(String, 'action_plan', 10)
# Configure LLM API key
# openai.api_key = "YOUR_API_KEY"
self.get_logger().info("LLM Planner node started.")
def build_prompt(self, command_text):
# This is where you construct the detailed prompt as described above
prompt = f"""
You are a helpful robot assistant...
User command: "{command_text}"
"""
return prompt
def command_callback(self, msg):
self.get_logger().info(f'Received command: "{msg.data}"')
# 1. Build the prompt
prompt = self.build_prompt(msg.data)
# 2. Call the LLM API
try:
# response = openai.Completion.create(...) or similar
# For this example, we'll use a hardcoded response
llm_response_json = """
[
{"function": "go_to", "parameters": ["table"]},
{"function": "find_object", "parameters": ["apple", "green"]},
{"function": "pick_up", "parameters": ["object_id_from_find_object"]}
]
"""
# 3. Parse the response
plan = json.loads(llm_response_json)
# 4. Publish the plan
plan_msg = String()
plan_msg.data = json.dumps(plan) # Re-serialize for publishing
self.plan_publisher_.publish(plan_msg)
self.get_logger().info(f'Published plan: {plan_msg.data}')
except Exception as e:
self.get_logger().error(f'Error processing command: {e}')
# main() function as before
This node acts as the brain of our operation. It listens for a high-level goal and uses the LLM's general-purpose reasoning engine to produce a structured, machine-readable plan that the robot's other systems can execute.
Local vs. Cloud LLMs
-
Cloud-based LLMs (e.g., OpenAI API, Gemini API):
- Pros: Easy to use, access to the most powerful state-of-the-art models.
- Cons: Requires an internet connection, can have latency, and may have privacy implications or API costs.
-
Locally-run LLMs (e.g., Llama 3, Mistral 7B via Ollama or LM Studio):
- Pros: Runs entirely offline, no latency issues, full privacy.
- Cons: Requires powerful hardware (often a GPU with significant VRAM), and the models are typically smaller and less capable than the flagship cloud models.
For robotics, running a smaller, quantized model locally is often the preferred approach for production systems to ensure reliability and low latency. For this book, we will focus on the logic of the prompt and parser, which remains the same regardless of which LLM backend you choose.
Summary
By using a Large Language Model as a task planner, we can bridge the gap between ambiguous human language and the structured commands a robot needs. Through careful prompt engineering, we can constrain the LLM to act as a reliable reasoning engine, breaking down complex goals into a sequence of primitive, executable actions. This planner node forms the crucial link between understanding a command and acting on it.