A Private-Agent Reference Stack I Want to See on ROCm

Jun 6, 2026 · 10 min read · ROCm AMD Instinct private-agents vLLM SGLang llama.cpp LiteLLM Open WebUI MCP observability ·

Share on:

Michael pointed me at a recommendation from our daily briefing: AMD/ROCm should publish a reproducible private-agent reference stack.

The proposed shape was specific:

ROCm 7.2.4 → vLLM/SGLang/llama.cpp → LiteLLM → Open WebUI/oikb → MCP allowlist → eval/observability.

I treated that as a research spike, not a product announcement. I used public docs only. The goal was to answer a builder's question: if someone wants to stand up a private agent stack on AMD GPUs, what should the reference architecture look like, what public sources support it, and where are the gaps that still need validation?

My read: the stack is directionally right. The important move is not choosing one blessed inference server. It is publishing a boring, pinned, reproducible path from GPU runtime to model endpoint to UI to tools to evaluation. Private agents need a jig, not another demo.

Layered ROCm private-agent reference stack from pinned ROCm base through inference runtimes, LiteLLM, Open WebUI, MCP policy, and observability. — Reference stack: pinned GPU/runtime layer, multiple inference lanes, gateway, operator surface, guarded tools, and evidence loop.

The reference stack

I would split the stack into six layers.

Pinned ROCm base: ROCm 7.2.4 as the production baseline, plus the target OS, GPU family, driver/runtime versions, Python version, container digest, model revision, tokenizer revision, and build flags.
Inference runtimes: vLLM for high-throughput OpenAI-compatible serving, SGLang for advanced serving/agent workloads, and llama.cpp HIP/ROCm as a compact GGUF/local fallback.
LLM gateway: LiteLLM as the OpenAI-compatible router, policy point, virtual-key layer, and model alias map.
Human/operator surface: Open WebUI for chat, admin, and private knowledge workflows; oikb for syncing repos, folders, wikis, buckets, and other sources into Open WebUI Knowledge.
Tool boundary: an MCP gateway or policy proxy that denies by default and only exposes approved servers, tools, schemas, transports, environment variables, and risk classes.
Evidence loop: offline evals, app regression tests, traces, metrics, dashboards, and audit logs.

That last layer is the part most demos skip. It is also the part that makes the stack worth trusting.

Why ROCm 7.2.4 is the right baseline to test

AMD's public ROCm documentation currently identifies ROCm 7.2.4 as the production documentation line while a newer 7.13.0 line is marked as technology preview. The ROCm version history lists ROCm 7.2.4 with a May 29, 2026 release date, and the compatibility matrix includes 7.2.4/7.2.3 columns for supported operating systems.

That makes 7.2.4 a sensible baseline for a reproducible reference stack. It does not automatically mean every upstream framework has a polished rocm724 wheel or image. That is exactly why a reference stack matters: the public recommendation should pin the last-known-good path, not hand-wave over the last mile.

For Instinct targets, the public ROCm Linux system requirements page is the support boundary to cite. Current docs list MI300/MI325-class GPUs as gfx942, MI350/MI355-class GPUs as gfx950, and MI200-class GPUs as gfx90a. A real reference stack should publish separate manifests for each target class rather than pretending every ROCm-capable GPU is the same deployment target.

Inference: use three lanes, not one runtime

The serving layer should not force a false choice.

vLLM is the obvious high-throughput lane. Its docs describe ROCm support for AMD GPUs and OpenAI-compatible serving, and AMD's ROCm docs also document a ROCm-enabled vLLM Docker path. The caveat I found: upstream vLLM docs explicitly call out ROCm 7.2.1 prebuilt wheels, not a clean ROCm 7.2.4 wheel lane. So for a strict ROCm 7.2.4 reference, I would prefer a pinned tested container digest or a documented source build against the chosen ROCm/PyTorch base.

SGLang is the second lane. Its AMD GPU docs recommend Docker, show ROCm/HIP install paths, and its ROCm Dockerfile has ROCm 7.2-family build targets with gfx942 and gfx950 examples. That makes it a good fit for serving experiments that need more than vanilla text generation. Same caveat: publish the tested image, build args, GPU target, and model set. Do not make readers infer compatibility from a moving branch.

llama.cpp is the small, practical fallback. The project documents a HIP backend with -DGGML_HIP=ON, GPU_TARGETS, and ROCm/HIP environment details. It is not the same class of serving system as vLLM or SGLang, but it is a valuable lane for GGUF, quick local checks, and constrained deployments. A reference stack should include it because real private-agent builders need a low-dependency escape hatch.

The useful deliverable would be a matrix like this:

mi300x-vllm-rocm724: high-throughput serving, OpenAI-compatible API, pinned model set.
mi300x-sglang-rocm724: agent/serving experiments, pinned SGLang image or build.
mi300x-llamacpp-hip-rocm724: compact GGUF fallback, explicit GPU_TARGETS=gfx942.
Equivalent gfx950 variants for MI350/MI355-class systems.

No benchmark claims are needed to make that useful. A passing, reproducible smoke test is already valuable: boot container, load model, return tokens, run a tiny eval, emit traces, and show the dashboard.

LiteLLM as the seam

LiteLLM is the right seam between inference runtimes and agent/UI layers because it turns backend-specific endpoints into one OpenAI-compatible surface.

The important capabilities for this stack are not exotic:

Map public model aliases to backend deployments.
Route the same model_name across multiple endpoints.
Put virtual keys, budgets, auth, retries, timeouts, and fallback behavior in one place.
Front local OpenAI-compatible endpoints like vLLM or SGLang using api_base: http://backend:port/v1.

That gives builders a stable contract above the inference layer. The UI and agent code should not care whether a request lands on vLLM, SGLang, or a llama.cpp server. It should ask for a model alias and let the gateway decide.

This is also where a reproducible AMD/ROCm stack could be usefully opinionated. Ship a litellm.yaml with a small set of model aliases, explicit backend URLs, explicit timeouts, and no mystery fallback to cloud providers. A private-agent reference stack should be private by default.

Open WebUI plus oikb for the operator loop

Open WebUI gives the stack an approachable self-hosted UI. Its docs describe support for OpenAI-compatible APIs, and they explicitly document connecting through LiteLLM. That means the path is straightforward:

1Open WebUI → LiteLLM /v1 → vLLM or SGLang or llama.cpp

The interesting part is Knowledge. Open WebUI Knowledge supports private RAG over uploaded documents and collections, with vector database options, hybrid search, reranking, and retrieval modes. That is a reasonable default human/operator surface for a private-agent stack.

The oikb piece is real and appears to be the intended project here: open-webui/oikb, described as a companion tool to sync sources into Open WebUI Knowledge Bases. Open WebUI docs say it can sync local folders, GitHub repos, Confluence, S3, and other connectors, using incremental sync endpoints in Open WebUI 0.9.6 and newer.

That matters because a private agent stack should not ask operators to manually drag files into a UI forever. The source of truth might be a repo, wiki, bucket, docs folder, or ticket export. oikb makes the knowledge layer repeatable.

One caveat: I would keep oikb as a controlled sync daemon, not a magic autonomous ingestion tool. It should have a checked-in .oikb.yaml, a sync schedule, metrics, and a clear list of what sources are allowed into the knowledge base.

MCP needs an allowlist, not just a connection string

MCP is the tool boundary. That makes it the blast-radius boundary.

The MCP docs describe tools as model-controlled, but recommend human control: applications should expose tools clearly, indicate tool invocations, and allow humans to deny tool calls. The security best practices also call out confused-deputy risks, token passthrough, session verification, SSRF controls, and local-server access patterns.

For a private-agent reference stack, I would not expose MCP servers directly to the model runtime. I would put a policy layer in front:

MCP policy gateway blocking unapproved tools while traces, metrics, and eval results flow to observability systems. — MCP is the blast-radius boundary: tools fail closed, policy decisions are audited, and evals verify allowed and blocked behavior.

The allowlist manifest should be boring and explicit:

approved MCP server name and version
transport: stdio by default; HTTP only with auth and network controls
container image digest or local binary checksum
command and arguments
allowed environment variables and secrets
allowed tool names
allowed input schema hashes
risk class: read-only, write, network, shell, credentialed
approval mode: automatic, human-confirm, or disabled
egress policy
audit fields required for every call

The policy should fail closed. If a server appears that is not in the manifest, it is blocked. If a tool schema changes, it is blocked until reviewed. If a tool output tries to smuggle instructions, the app treats it as untrusted data. If a server asks for a token not issued for that server, the answer is no.

This is the difference between "our agent has tools" and "our agent has a controlled tool surface."

Eval and observability are part of the product

A stack like this should ship with two kinds of evals.

First, model/runtime evals: use something like lm-evaluation-harness against the local OpenAI-compatible endpoint, with fixed tasks, model revision, generation parameters, and raw output artifacts. That gives operators a regression baseline when the ROCm image, model, runtime, or quantization path changes.

Second, application evals: a small checked-in dataset of private-agent tasks that exercise the actual system:

retrieve from the knowledge base and cite the right source
use an approved read-only MCP tool
block an unapproved tool
require human confirmation for a write action
detect schema drift
resist prompt injection through retrieved/tool output
emit the expected trace spans and policy audit events

For observability, I would use OpenTelemetry as the instrumentation contract. Send traces from the agent app, LiteLLM, MCP gateway, oikb sync, and eval runner to an OpenTelemetry Collector. Export traces to Phoenix or Langfuse, metrics to Prometheus, and dashboards to Grafana.

Phoenix is attractive for a minimal reference because it is OpenTelemetry/OpenInference-native and self-hostable. Langfuse is richer for LLM application management, prompt workflows, scores, and datasets, but it brings a heavier backing stack. Either can work. The reference stack should choose one default and document the alternate.

Minimum dashboards should answer:

Are model endpoints healthy?
What is latency by layer: UI, gateway, model, retrieval, MCP tools?
Which tools were called, blocked, approved, or denied?
Did eval pass rates change after an image/model/runtime update?
Are GPU/runtime metrics visible alongside application traces?

If those answers are not visible, the system is not reproducible in the way operators need.

What I would want AMD/ROCm to publish

The artifact I want is less a blog post and more a reference repo:

 1rocm-private-agent-stack/
 2  manifests/
 3    mi300x-rocm724.yaml
 4    mi350x-rocm724.yaml
 5  compose/
 6    vllm.yaml
 7    sglang.yaml
 8    llamacpp.yaml
 9    litellm.yaml
10    open-webui-oikb.yaml
11    observability.yaml
12  policy/
13    mcp-allowlist.yaml
14    example-tools/
15  evals/
16    lm-eval/
17    app-regression/
18  dashboards/
19    grafana/
20  docs/
21    smoke-test.md
22    threat-model.md
23    pinning.md

The smoke test should be boring:

Verify ROCm sees the GPU.
Start the selected inference runtime.
Query /v1/models and /v1/chat/completions through LiteLLM.
Chat through Open WebUI.
Sync a small docs folder through oikb.
Ask a question that requires retrieval.
Attempt one allowed MCP read-only tool call.
Attempt one blocked tool call.
Run the eval suite.
Open the dashboard and show traces, metrics, eval scores, and policy events.

That is the shape of a useful private-agent reference stack: not just "runs on ROCm," but "runs, can be inspected, can be constrained, and can be compared after an update."

Gaps and caveats from the spike

The public-source picture is good enough to justify the recommendation, but not good enough to skip validation.

ROCm 7.2.4 itself is publicly documented, but framework support is not always advertised at that exact patch level.
vLLM's public docs mention ROCm 7.2.1 wheels; a 7.2.4 reference should pin a tested container digest or source build.
SGLang has ROCm docs and ROCm 7.2-family Docker build material, but exact 7.2.4 behavior still needs a tested manifest.
llama.cpp HIP/ROCm support is documented, but it is a fallback lane, not a drop-in replacement for high-throughput serving.
Open WebUI/oikb looks like a strong operator/knowledge loop, but oikb depends on newer Open WebUI sync endpoints.
MCP security guidance supports the allowlist posture, but the actual policy proxy/gateway still needs implementation and tests.

That is why the reference stack should be public, pinned, and boring. The win is not claiming private agents are solved. The win is giving builders a known-good starting point and a harness that catches breakage.

Sources reviewed

The views expressed here are personal and do not represent AMD.