Files
proxmox-iac/docs/superpowers/specs/2026-06-18-hermes-agent-lxc-design.md
21in7 f6dc709793 docs: features set in Terraform (token can); only bind mounts via console
Correct README/plan/spec after the apply-failure root cause: nesting/keyctl
are settable by the API token on an unprivileged CT and are required at create
to avoid the systemd-252 TASK WARNINGS that fails apply. Console step reduced
to bind mounts only. README apply uses -target (PBS disk drift).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 00:18:23 +09:00

178 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Hermes Agent LXC — Design Spec
- **Date:** 2026-06-18
- **Author:** gihyeon (with Claude Code)
- **Status:** Approved design → ready for implementation plan
- **Repo:** `proxmox-iac` (Terraform / bpg/proxmox provider)
## 1. Goal
Deploy [Hermes Agent](https://hermes-agent.nousresearch.com/) (Nous Research,
open-source MIT agent platform) as a new container on **node1 (`gihyeon`)**, using
the existing **litellm** LXC as its LLM gateway. Primary use is **messaging
connectors** (Telegram / Discord / Slack). The agent must be able to store code
and generated files on the host's large disks via direct bind mounts.
## 2. Context (verified 2026-06-18 via Proxmox API)
### litellm LXC (existing)
| Item | Value |
|---|---|
| VMID / host | `117` / `gihyeon` (node1) |
| Spec | 2 core / 2GB RAM / 4GB disk (`hdd`) |
| Network | SDN vnet `intra01`, IP `10.1.10.22/24` (DHCP) |
| Endpoint | LiteLLM proxy, default port `4000``http://10.1.10.22:4000` |
| Type | unprivileged LXC, Debian, community-script install, `nesting=1` |
### node1 (`gihyeon`) headroom
- CPU 12 threads / RAM 64GB (~32GB free)
- Storage: `local-lvm` 93GB free (SSD/LVM-thin), `hdd` 10TB free, `media` 1.3TB free
- intra01 has internet egress (litellm was installed from the internet and shows outbound traffic)
### Storage host paths
| Proxmox storage | Host path | Disk | Free |
|---|---|---|---|
| `media` | `/media/2tb` | nvme (SSD) | 1.3TB |
| `hdd` | `/mnt/pve/hdd` | bulk | 10TB |
### Hermes Agent facts (from official docs)
- Two install paths: **Docker image** `nousresearch/hermes-agent` (compose provided) or native `install.sh` (uv/python3.11/node/ripgrep/ffmpeg).
- LLM connection: supports **OpenAI-compatible `base_url`**`provider: custom`, `base_url: <litellm>`. Config in `~/.hermes/config.yaml`, secrets in `~/.hermes/.env`.
- Ports: `8642` (gateway API, OpenAI-compatible), `9119` (web dashboard). **Neither required for messaging-only use.**
- Resources: min 1C/1GB, **recommended 2C/24GB / 2GB+ disk**. Browser tools want `--shm-size=1g`.
- **Not privileged by default.** Subagent sandbox backends: local / Docker / SSH / Singularity / Modal. Docker sandbox needs `/var/run/docker.sock` (DinD) — **not used here**; we start with `sandbox=local`.
- Single data mount inside the image: `/opt/data` (maps to host `~/.hermes`): config, sessions, memories, skills, logs, credentials.
## 3. Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Deployment form | **Docker LXC (unprivileged)** | Matches homelab convention (multiple docker LXCs: 101/104/119/124); low overhead; official image + clean upgrades; Hermes needs no privileged mode. |
| Provisioning | **Terraform (container incl. features) + console for bind mounts** | TF mirrors `pbs.tf` and also sets `features { nesting/keyctl }` (token CAN do this on an unprivileged CT; nesting at create time avoids the systemd-252 "enable nesting" warning that fails the apply). **Only bind mounts** can't be done by the token (host paths require `root@pam`), so `mp0/mp1` are added via console `pct set` — same method already used for jellyfin(115)/tos-api(700). `terraform import` of the mounts is a follow-up. |
| Primary interface | **Messaging connectors** | Outbound-only → **zero inbound ports exposed.** |
| Subagent sandbox | **local** | Avoids Docker-in-Docker friction in an unprivileged LXC; revisit later if isolation needed. |
| Large workspace | **Direct host bind mount (both disks)** | Aligns with the user's **Plan A** (same-host LXC → host bind mount, not nfs LXC re-share). No network hop, no nfs-LXC SPOF. See `nfs-lxc-sharing-redesign` memory. |
## 4. Architecture
```
[Messaging platforms] node1 (gihyeon) / intra01 (10.1.10.0/24)
Telegram/Discord ──outbound──▶ ┌────────────────────────────────┐
/Slack ... │ hermes LXC #118 (unpriv+Docker)│
│ └ nousresearch/hermes-agent │
│ (compose, sandbox=local) │
│ /data ◀─ bind /mnt/pve/hdd/hermes
│ /fast ◀─ bind /media/2tb/hermes
└──────────┬─────────────────────┘
│ LLM (OpenAI-compatible)
litellm LXC #117 (10.1.10.22:4000)
│ routes to upstream providers
Anthropic / OpenAI / local / ...
```
## 5. Container spec (Terraform, bpg provider)
| Field | Value |
|---|---|
| VMID | `118` (adjacent to litellm `117`, AI group) |
| Node | `gihyeon` |
| Type | unprivileged LXC, Debian 12 |
| Features | `nesting = 1`, `keyctl = 1` (required for Docker) — **set in Terraform** (token can set these on an unprivileged CT; nesting at create avoids the systemd-252 warning that fails the apply) |
| CPU / RAM | 2 cores / 4096 MB dedicated (+512 MB swap) |
| rootfs | 24 GB on `local-lvm` |
| Network | `eth0` on bridge `intra01`, IPv4 DHCP |
| Options | `start_on_boot = true`, tags `ai;agent;terraform` |
| Hostname | `hermes` |
### Bind mounts (large workspace)
| mount | Host path | Container path | Purpose |
|---|---|---|---|
| `mp0` | `/mnt/pve/hdd/hermes` | `/data` | 14TB bulk: code, artifacts, downloads |
| `mp1` | `/media/2tb/hermes` | `/fast` | SSD: fast workspace / builds |
**Bind mounts are NOT in Terraform.** The Proxmox API token cannot create bind
mounts (root@pam/SSH only), so `mp0/mp1` are added in the console with
`pct set 118 -mp0 /mnt/pve/hdd/hermes,mp=/data -mp1 /media/2tb/hermes,mp=/fast`.
Both container paths are then passed into the Hermes Docker container as volumes
so the agent's outputs land on the large disks. `~/.hermes` (`/opt/data`,
small/fast config + memory + sqlite) stays on rootfs (SSD), **not** on the bulk disk.
A `terraform import` of these mount points is tracked as a follow-up (same as 115/700).
### Unprivileged UID mapping (critical)
Unlike jellyfin(115)/tos-api(700) — which are *privileged* (root→root, no perms
issue) — hermes is **unprivileged**, so its root maps to host UID `100000`. The
bind-mount host directories must be owned by the mapped root. A dedicated
subdirectory per disk (`…/hermes`) is `chown 100000:100000`, so **only that
subtree is remapped** (isolation preserved), not the whole disk.
## 6. Networking & security
- On `intra01` (same subnet as litellm) → reaches `10.1.10.22:4000` directly.
- Messaging connectors poll outbound → **no inbound port forwarding / no firewall opening.**
- Dashboard (`9119`) and gateway API (`8642`) **not exposed**. If first-time setup needs the dashboard, use it transiently via console / temporary port-forward, or `HERMES_DASHBOARD_INSECURE=1` on the trusted net.
- Secrets (litellm key, bot tokens) live only in the container's `~/.hermes/.env`; **never committed**.
## 7. Software stack & LLM connection
- Docker + docker-compose-plugin installed in the LXC.
- `nousresearch/hermes-agent` run via compose (`gateway run`), `restart: unless-stopped`.
- `~/.hermes/config.yaml`:
```yaml
model:
default: <model name exposed by litellm>
provider: custom
base_url: http://10.1.10.22:4000/v1
```
- `~/.hermes/.env`: litellm API key (`OPENAI_API_KEY`), messaging bot tokens.
- Messaging extras (Telegram/Discord/Slack) enabled in the gateway image.
## 8. Provisioning sequence (order matters)
1. **Host prep** (node1 web console, once): create + chown bind-mount targets.
```sh
mkdir -p /mnt/pve/hdd/hermes /media/2tb/hermes
chown 100000:100000 /mnt/pve/hdd/hermes /media/2tb/hermes
```
2. **Terraform apply** (from workstation, `-target` hermes only): creates LXC #118
with rootfs, network, cpu/mem, unprivileged, onboot, **and `features { nesting/keyctl }`**.
No bind mounts (host paths need root@pam). `-target` avoids the pre-existing PBS disk drift.
3. **Add bind mounts** (node1 console, once): use `pct set` (mounts only — features already in TF):
```sh
pct set 118 -mp0 /mnt/pve/hdd/hermes,mp=/data \
-mp1 /media/2tb/hermes,mp=/fast
pct reboot 118
```
4. **Container bootstrap** (LXC console, once): `scripts/hermes-bootstrap.sh` —
install Docker (rootful) + compose plugin → write `docker-compose.yml` +
`config.yaml` pointing at litellm → fill `.env` (litellm key, bot tokens) →
`hermes setup` → `gateway run`.
> In-container / host shell work is performed by the user via the **PVE web
> console** (per `proxmox-access` memory — host SSH intentionally unused).
## 9. Repo changes
- **New:** `hermes.tf` (container resource — **no bind mounts**),
`hermes-variables.tf`, `scripts/hermes-bootstrap.sh` (host prep + `pct set` mounts + Docker/hermes install).
- **Modified:** `terraform.tfvars` + `terraform.tfvars.example` (hermes vars),
`outputs.tf` (VMID / IP), `README.md` (install steps), `gitignore` (ensure `.env` / secrets excluded).
## 10. Values to fill at setup time
- litellm master/virtual key and the exact **model name** litellm exposes.
- Messaging bot tokens (Telegram / Discord / Slack as chosen).
## 11. Out of scope / future
- Docker sandbox backend (DinD) for stronger subagent isolation — deferred; start `local`.
- Static IP instead of DHCP — deferred (DHCP matches litellm).
- Dashboard/gateway-API exposure with auth — only if a non-messaging use appears.
- `terraform import` of the hermes `mp0/mp1` bind mounts into TF state — follow-up (same pattern as 115/700 in `nfs-lxc-sharing-redesign`).
- Use **rootful** Docker in the LXC (not rootless): Hermes' gateway↔dashboard talk over localhost in one container, so a single netns is required. The ZFS overlay2→vfs caveat from public writeups does not apply here (storage is LVM-thin/ext4/dir, not ZFS).
## 12. Rollback
- `terraform destroy -target` the hermes container, or `pct destroy 118`.
- Bind-mount host dirs (`/mnt/pve/hdd/hermes`, `/media/2tb/hermes`) remain unless manually removed.
## 13. Verification (post-deploy)
- LXC 118 running; `pct config 118` shows mp0/mp1 + `nesting=1`.
- Inside container: `/data` and `/fast` writable by container root; `docker ps` shows hermes healthy.
- Hermes can call litellm: a test prompt routes through `10.1.10.22:4000` and returns.
- A messaging connector responds end-to-end; agent-written file appears under `/mnt/pve/hdd/hermes` on the host.