Skip to main content
LLM agents run as native DimOS modules. They subscribe to camera, LiDAR, odometry, and spatial memory streams and they control the robot through skills.

Architecture

Human Input ──→ Agent ──→ Skill Calls ──→ Robot
  (text/voice)     │         (RPC)

          subscribes to streams:
          color_image, odom, spatial_memory
McpClient (dimos/agents/mcp/mcp_client.py) is a Module with:
  • human_input: In[str]: receives text from humancli, WebInput, or agent-send
  • agent: Out[BaseMessage]: publishes agent responses (text, tool calls, images)
  • agent_idle: Out[bool]: signals when the agent is waiting for input
The agent uses LangGraph with a configurable LLM. The default is gpt-4o and you need to provide an OPENAI_API_KEY environment variable. On startup, it discovers all @skill-annotated methods across deployed modules via RPC and exposes them as LangChain tools.

Skills

Skills are methods decorated with @skill on any Module. The agent discovers them automatically at startup.
from dimos.agents.annotation import skill
from dimos.core.module import Module

class MySkillContainer(Module):
    @skill
    def wave_hello(self) -> str:
        """Wave at the nearest person."""
        # ... robot control logic ...
        return "Waving!"
Rules:
  • Parameters must be JSON-serializable primitives (str, int, float, bool, list, dict).
  • Docstrings become the tool description the LLM sees. Write them clearly so the agent has sufficent context.
  • The function must return a string or image which with be used by the agent to decide what to do next.

Built-in Skills

SkillModuleDescription
relative_move(forward, left, degrees)UnitreeSkillContainerMove robot relative to current position
execute_sport_command(command_name)UnitreeSkillContainerUnitree sport commands (sit, stand, flip, etc.)
wait(seconds)UnitreeSkillContainerPause execution
observe()GO2ConnectionCapture and return current camera frame
navigate_with_text(query)NavigationSkillContainerNavigate to a location by description
tag_location(name)NavigationSkillContainerTag current position for later recall
stop_navigation()NavigationSkillContainerCancel current navigation goal
follow_person(query)PersonFollowSkillVisual servoing to follow a described person
stop_following()PersonFollowSkillStop person following
speak(text)SpeakSkillText-to-speech through robot speakers
where_am_i()GoogleMapsSkillContainerCurrent street/area from GPS
get_gps_position_for_queries(queries)GoogleMapsSkillContainerLook up GPS coordinates
set_gps_travel_points(points)GPSNavSkillNavigate via GPS waypoints
map_query(query)OsmSkillSearch OpenStreetMap with VLM

MCP

All agentic blueprints use two modules: McpServer and McpClient.
  • McpServer exposes the methods annotated with @skill as MCP tools. Any external client can connect to the server to use the MCP tools.
  • McpClient has a LangGraph LLM which calls MCP tools from McpServer.
CLI access:
dimos mcp list-tools                                # List available skills
dimos mcp call relative_move --arg forward=0.5      # Call a skill
dimos mcp status                                    # Server status

Input Methods

MethodHow it works
humancliStandalone terminal — type messages, see responses
dimos agent-send "text"One-shot CLI command via LCM
WebInputWeb interface at localhost:7779 with optional Whisper STT

Models

ConfigModelNotes
Defaultgpt-4oBest quality, requires OPENAI_API_KEY
ollama:llama3.1Local OllamaRequires ollama serve running
CustomAny LangChain-compatibleSet via McpClient.blueprint(model="...")