Running Hermes Agent on Qubrid AI: Open-Source Autonomous Agents with Your Own Models and API Keys
Most AI agent platforms make a quiet trade-off on your behalf: they pick the model, log the requests, set the rate limits, and bill you a seat fee on top of everything. You get a product. Not infrastructure.
At Qubrid, we think the better model - pun intended - is giving developers full control over what model runs, where requests go, and what it costs per token. That's exactly what this integration enables.
Hermes Agent, built by Nous Research, is one of the most capable open-source autonomous agent frameworks available today. In this post, we'll walk through how to connect it to Qubrid's serverless inference API, explain why the architecture matters, and share our recommended model for agentic workloads: DeepSeek V4 Flash.
What Is Hermes Agent?
Hermes is not a chatbot wrapper. It's a full agentic runtime - it maintains state across turns, orchestrates multi-step tool calls, and ships with native integrations for Telegram, WhatsApp, Discord, and Microsoft Teams. The primary interface is the CLI, but once you've linked a messaging platform, your agent is reachable from anywhere.
What makes Hermes architecturally clean is that it's built around the OpenAI Chat Completions standard as a transport layer. The agent doesn't care what model is on the other end of the API - it sends messages, reads tool call responses, and continues the loop. That abstraction is what lets you point it at Qubrid (or any other compatible provider) without changing anything else about how Hermes works.
At runtime, Hermes exposes the model a structured function-calling schema that includes:
Web search - live search results passed back as tool output
Browser automation - full Playwright-based navigation and interaction
Code execution - write and run code locally
File I/O - read, write, and manage files on your machine
Image generation and text-to-speech - configurable at setup
How well the agent uses these tools depends almost entirely on how good your model is at instruction following and tool-call formatting. Which is why model selection for Hermes isn't an afterthought.
The Case for Running Your Own API Keys
When you bring your own API key through a provider like Qubrid, the architecture changes in ways that matter at scale.
Your data path is direct. Requests go from Hermes → Qubrid inference endpoint → model → back. No intermediary logging your prompts to train a product you don't control.
You choose the model. And you can change it whenever you want. A lightweight flash model for fast, repetitive tasks; a larger reasoning model for complex multi-step work. The agent runtime stays identical - you're just turning a dial.
The cost structure is transparent. Token-based pricing, no seat fees, no monthly caps. Qubrid's Inference Logs give you per-request token counts and latency breakdowns in real time so you always know exactly what you're spending.
You own the failure modes. When something breaks, you have logs, request history, and something concrete to debug. That's not a given with black-box SaaS.
Why We Recommend DeepSeek V4 Flash for Hermes
Choosing a model for an agentic workload is fundamentally different from choosing one for a single-turn chatbot. In an agent loop, what matters is:
Tool call accuracy. Does the model format function calls correctly and consistently? A model that occasionally drops malformed JSON in a single-turn chat is annoying. In an agent loop, one bad tool call can corrupt the downstream context and cascade through the rest of the task.
Multi-turn instruction adherence. Agentic tasks might span 10–20 model calls. Does the model stay on task across a long plan, or does it drift and start hallucinating intermediate steps?
Context efficiency. Context accumulates fast in agent sessions - tool results, prior turns, the system prompt describing every available function. A model that degrades at longer contexts will fall apart on real workloads.
Latency. Agents make multiple model calls per task. A 3-second TTFT × 8 tool calls means 24 seconds of pure model wait time. That compounds badly at scale.
DeepSeek V4 Flash performs well across all of these. Tool call formatting is reliable. Instruction adherence holds across long task chains. And on Qubrid's serverless infrastructure, latency is low enough that the agent loop feels responsive, not like something thinking in the background.
It's also cost-efficient - which matters when a single "user interaction" might generate 10 API calls under the hood.
For heavier reasoning tasks - complex code generation, multi-document synthesis, architectural decisions - you can swap to a larger model by changing one line in your Hermes config. Same agent, same Qubrid endpoint, different brain.
How the Integration Works Under the Hood
When Hermes talks to Qubrid, it sends a standard POST /v1/chat/completions request. The body includes the conversation history, a system prompt defining the agent's capabilities, and a tools array describing every function the model can invoke.
Qubrid receives this at https://platform.qubrid.com/v1, routes it to the serverless model layer, and returns a standard Chat Completions response. If the model decides to use a tool, the response includes a tool_calls array. Hermes parses that, executes the tool locally, appends the result as a tool role message, and sends the full updated context back for the next model turn.
This continues until the model produces a final text response with no tool calls - at which point Hermes surfaces the output.
Because Qubrid implements the exact same request/response schema as the OpenAI Chat Completions spec, Hermes requires zero modifications to work with it. It sees a compatible endpoint and behaves identically to any other provider.
Setup Guide
One terminal session. Under five minutes.
1. Install Hermes
# Linux / macOS / WSL2 / Termux
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
# Windows PowerShell
iex (irm https://hermes-agent.nousresearch.com/install.ps1)The installer handles Python 3.11, Node.js, ripgrep, ffmpeg, and a portable Git Bash on Windows automatically. Windows note: If Smart App Control blocks the installer, disable it under Settings → Privacy & Security → Windows Security → App & browser control → Smart App Control, then re-run.
2. Choose Full Setup
When prompted, select Full Setup (option 2). This lets you configure your own provider, API key, and model - Quick Setup locks you into defaults.
3. Configure Qubrid as the Provider
Work through the wizard with these values:
Provider: Custom Endpoint (enter URL manually)
API Base URL: https://platform.qubrid.com/v1
API Key: [your Qubrid API key - generated at platform.qubrid.com/api-keys]
Compatibility mode: Chat Completions → 2
Model string: [copy from Qubrid Serverless Models page]
Context length: [leave blank - auto-populated]
Display name: Qubrid AI Platform
Terminal backend: Local → 1Your API key is shown exactly once after generation. Copy it before closing the dialog - if you lose it, you'll need to generate a new one.
4. Complete Setup and Launch
Configure or skip messaging platforms and tools as needed - all of this can be changed later. When setup finishes, run:
hermesSend a test message. If you're getting responses from your Qubrid model, you're done.
If you see errors: The three most common causes are a trailing slash on the base URL, an incorrectly pasted API key, or a model string that doesn't match the Serverless Models catalog exactly. Model strings are case-sensitive.
Verifying the Integration Is Working Correctly
Getting a response isn't the same as confirming the integration is behaving as expected. Do these three checks:
Check Inference Logs. Go to platform.qubrid.com/inferencing-logs. Your test prompt should appear as a logged request with token counts and latency. If it's not there, Hermes isn't routing to Qubrid.
Trigger a tool call. Ask Hermes to search for something live. You should see it execute the search tool, receive results, and synthesize a response across multiple turns. That loop confirms the tool-calling pipeline is intact.
Watch token growth on a multi-step task. Give Hermes something that requires several steps - research a topic, summarize, produce a structured output. Watch the token counts in Inference Logs grow across requests. This confirms context is accumulating correctly across the agent loop.
One Thing to Keep in Mind at Scale
Running an agentic framework means a single user interaction often maps to many model calls - not one. A research and synthesis task that feels like "one thing" might be 8–12 separate API requests: tool invocations, result processing, follow-up reasoning, final output.
That has real implications:
Cost scales with task complexity. A quick factual question and a multi-step research task are not the same cost. Qubrid's Inference Logs make this visible per request, which makes budgeting tractable.
Model reliability compounds. One bad tool call mid-chain can corrupt everything downstream. This is specifically why we recommend DeepSeek V4 Flash for Hermes - its consistency on structured output and tool formatting reduces the rate of cascading failures significantly compared to models that are nominally capable but inconsistent.
Latency stacks. If response time matters for your use case, benchmark against your actual task types, not just a single-turn ping. You can also tighten Hermes' tool configuration - removing tools you're not using reduces the system prompt size and the model's decision surface on every call.
Configuration Reference
Setting | Value |
|---|---|
API base URL |
|
Compatibility mode | Chat Completions ( |
Recommended model | DeepSeek V4 Flash |
API keys |
|
Inference Logs |
|
Exit Hermes |
|
Uninstall |
|
Full Hermes Agent documentation is available at https://docs.platform.qubrid.com/Integrations/Hermes
