Files
proxmox-iac/docs/superpowers/plans/2026-06-18-hermes-agent-lxc.md
21in7 92851a384f docs: add Hermes Agent LXC implementation plan + spec amendments
Plan: 10 tasks splitting workstation Terraform (token-safe container skeleton)
from PVE-console host ops (features nesting/keyctl + bind mounts via pct set,
which the API token cannot do) and in-container Docker/hermes bootstrap.

Spec amended for the discovered API-token limitation: bind mounts AND container
features require root@pam/SSH, so both are applied via console pct set rather
than Terraform; terraform import tracked as follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 23:42:27 +09:00

538 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Hermes Agent LXC Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Deploy Nous Research Hermes Agent as an unprivileged Docker LXC (#118) on node1 (`gihyeon`), using the existing litellm LXC (`10.1.10.22:4000`) as its OpenAI-compatible LLM gateway, with large-disk bind mounts for the agent workspace.
**Architecture:** Terraform creates a token-safe LXC skeleton (rootfs, network, cpu/mem). Host-security settings the API token cannot set — container `features` (nesting/keyctl) and bind mounts — are applied once via the PVE web console with `pct set`. A bootstrap script then installs rootful Docker and runs the official `nousresearch/hermes-agent` image via compose, pointed at litellm, with `sandbox=local` and messaging connectors.
**Tech Stack:** Terraform (bpg/proxmox provider), Proxmox VE 9.1 LXC, Docker + docker-compose, Hermes Agent (Nous Research).
**Spec:** [docs/superpowers/specs/2026-06-18-hermes-agent-lxc-design.md](../specs/2026-06-18-hermes-agent-lxc-design.md)
**Execution split:**
- **Workstation (Terraform):** Tasks 15, 8, 9 — run `terraform` against the API token.
- **PVE web console (user runs, pastes output back):** Tasks 6, 7 — host/in-container ops (per `proxmox-access`: host SSH is intentionally unused).
---
## File Structure
| File | Responsibility |
|---|---|
| `hermes-variables.tf` (new) | All Hermes LXC input variables with defaults |
| `hermes.tf` (new) | Debian12 template download (gihyeon) + token-safe container resource (no features, no mounts) |
| `terraform.tfvars` (modify) | Set Hermes values for this homelab |
| `terraform.tfvars.example` (modify) | Document Hermes values for other users |
| `outputs.tf` (modify) | Expose Hermes VMID + hostname |
| `scripts/hermes-bootstrap.sh` (new) | Host prep + `pct set` (features+mounts) + Docker install + compose + Hermes config (placeholders for secrets) |
| `README.md` (modify) | Document the 4-phase deploy flow |
---
## Task 1: Hermes input variables
**Files:**
- Create: `hermes-variables.tf`
- [ ] **Step 1: Write `hermes-variables.tf`**
```hcl
variable "hermes_vmid" {
description = "VMID for the Hermes Agent LXC"
type = number
default = 118
}
variable "hermes_hostname" {
description = "Hostname for the Hermes Agent LXC"
type = string
default = "hermes"
}
variable "hermes_node" {
description = "Proxmox node to host the Hermes Agent LXC"
type = string
default = "gihyeon"
}
variable "hermes_cores" {
description = "CPU cores for the Hermes Agent LXC"
type = number
default = 2
}
variable "hermes_memory" {
description = "Dedicated memory (MB) for the Hermes Agent LXC"
type = number
default = 4096
}
variable "hermes_swap" {
description = "Swap (MB) for the Hermes Agent LXC"
type = number
default = 512
}
variable "hermes_disk_size" {
description = "Root filesystem size (GB) for the Hermes Agent LXC"
type = number
default = 24
}
variable "hermes_datastore" {
description = "Datastore for the Hermes Agent LXC root filesystem"
type = string
default = "local-lvm"
}
variable "hermes_network_bridge" {
description = "Network bridge (SDN VNET) for the Hermes Agent LXC"
type = string
default = "intra01"
}
```
- [ ] **Step 2: Format + validate**
Run: `terraform fmt hermes-variables.tf && terraform validate`
Expected: `Success! The configuration is valid.` (validate may warn about the missing `hermes.tf` resource until Task 2 — that is fine; the goal here is no HCL syntax error in this file.)
- [ ] **Step 3: Commit**
```bash
git add hermes-variables.tf
git commit -m "feat: add Hermes Agent LXC variables"
```
---
## Task 2: Hermes container resource (token-safe skeleton)
**Files:**
- Create: `hermes.tf`
Reuses the existing `var.dns_servers` (defined in `pbs-variables.tf`).
- [ ] **Step 1: Write `hermes.tf`**
```hcl
# Download Debian 12 LXC template to gihyeon (node1).
resource "proxmox_virtual_environment_download_file" "debian12_template_gihyeon" {
content_type = "vztmpl"
datastore_id = "local"
node_name = var.hermes_node
url = "http://download.proxmox.com/images/system/debian-12-standard_12.12-1_amd64.tar.zst"
}
# Hermes Agent LXC — token-safe skeleton.
# IMPORTANT: container `features` (nesting/keyctl) and bind mounts are NOT set
# here. The Proxmox API token cannot set host-security settings; they are applied
# once via the PVE web console with `pct set` (see scripts/hermes-bootstrap.sh
# and docs/superpowers/specs/2026-06-18-hermes-agent-lxc-design.md).
resource "proxmox_virtual_environment_container" "hermes" {
description = "Hermes Agent (Nous Research) - Managed by Terraform"
node_name = var.hermes_node
vm_id = var.hermes_vmid
start_on_boot = true
unprivileged = true
tags = ["ai", "agent", "terraform"]
operating_system {
template_file_id = proxmox_virtual_environment_download_file.debian12_template_gihyeon.id
type = "debian"
}
cpu {
cores = var.hermes_cores
}
memory {
dedicated = var.hermes_memory
swap = var.hermes_swap
}
disk {
datastore_id = var.hermes_datastore
size = var.hermes_disk_size
}
network_interface {
name = "eth0"
bridge = var.hermes_network_bridge
}
initialization {
hostname = var.hermes_hostname
ip_config {
ipv4 {
address = "dhcp"
}
}
dns {
servers = var.dns_servers
}
}
}
```
- [ ] **Step 2: Format + validate**
Run: `terraform fmt hermes.tf && terraform validate`
Expected: `Success! The configuration is valid.`
- [ ] **Step 3: Commit**
```bash
git add hermes.tf
git commit -m "feat: add Hermes Agent LXC container resource"
```
---
## Task 3: tfvars values
**Files:**
- Modify: `terraform.tfvars`
- Modify: `terraform.tfvars.example`
Defaults in `hermes-variables.tf` already match this homelab, so tfvars only needs an explicit override block for clarity/discoverability.
- [ ] **Step 1: Append to `terraform.tfvars`**
Add after the existing DNS line:
```hcl
# Hermes Agent LXC 설정 (node1 / intra01)
hermes_vmid = 118
hermes_node = "gihyeon"
hermes_network_bridge = "intra01"
```
- [ ] **Step 2: Append the same block to `terraform.tfvars.example`**
```hcl
# Hermes Agent LXC 설정 (node1 / intra01)
hermes_vmid = 118
hermes_node = "gihyeon"
hermes_network_bridge = "intra01"
```
- [ ] **Step 3: Validate**
Run: `terraform fmt && terraform validate`
Expected: `Success! The configuration is valid.`
- [ ] **Step 4: Commit**
```bash
git add terraform.tfvars terraform.tfvars.example
git commit -m "feat: set Hermes Agent LXC tfvars"
```
---
## Task 4: Outputs
**Files:**
- Modify: `outputs.tf`
- [ ] **Step 1: Append to `outputs.tf`**
```hcl
output "hermes_container_id" {
description = "Hermes Agent LXC container ID"
value = proxmox_virtual_environment_container.hermes.vm_id
}
output "hermes_hostname" {
description = "Hermes Agent LXC hostname (IP is DHCP-assigned; discover via PVE/API)"
value = var.hermes_hostname
}
```
- [ ] **Step 2: Validate**
Run: `terraform validate`
Expected: `Success! The configuration is valid.`
- [ ] **Step 3: Commit**
```bash
git add outputs.tf
git commit -m "feat: add Hermes Agent LXC outputs"
```
---
## Task 5: Plan + apply the container (workstation)
**Files:** none (infra apply)
- [ ] **Step 1: Review the plan**
Run: `terraform plan`
Expected: plan shows `2 to add``proxmox_virtual_environment_download_file.debian12_template_gihyeon` and `proxmox_virtual_environment_container.hermes`. **0 to change, 0 to destroy.** Confirm it does NOT touch `proxmox_virtual_environment_container.pbs`.
- [ ] **Step 2: Apply**
Run: `terraform apply`
Expected: `Apply complete! Resources: 2 added, 0 changed, 0 destroyed.` Outputs include `hermes_container_id = 118`.
> If apply errors with a permission/`root@pam`-only message on any container attribute, STOP — it means an attribute in `hermes.tf` is host-restricted. The skeleton here is intentionally limited to attributes the PBS container already created successfully via the same token, so this is not expected.
- [ ] **Step 3: Confirm via API (read-only)**
Run:
```bash
curl -sk -H "Authorization: PVEAPIToken=root@pam!terrform=1408ded5-c7c4-4384-8b19-64178837fb8c" \
"https://192.168.50.87:8006/api2/json/nodes/gihyeon/lxc/118/status/current" \
| python3 -c "import json,sys; d=json.load(sys.stdin)['data']; print(d['name'], d['status'])"
```
Expected: `hermes running` (or `stopped` — the container may not auto-start before features/mounts; Task 7 reboots it).
- [ ] **Step 4: Commit state**
```bash
git add terraform.tfstate terraform.tfstate.backup
git commit -m "chore: apply Hermes Agent LXC (state)"
```
---
## Task 6: Host prep — create + chown bind-mount targets (PVE console)
**Run in the node1 (`gihyeon`) shell via PVE web console. Paste output back.**
- [ ] **Step 1: Create the workspace dirs and chown to the unprivileged-mapped root**
```sh
mkdir -p /mnt/pve/hdd/hermes /media/2tb/hermes
chown 100000:100000 /mnt/pve/hdd/hermes /media/2tb/hermes
ls -lnd /mnt/pve/hdd/hermes /media/2tb/hermes
```
Expected: both dirs exist and `ls -lnd` shows owner/group `100000 100000`.
---
## Task 7: Apply features + bind mounts, reboot (PVE console)
**Run in the node1 (`gihyeon`) shell via PVE web console. Paste output back.**
- [ ] **Step 1: Set features (Docker) and the two bind mounts**
```sh
pct set 118 -features nesting=1,keyctl=1 \
-mp0 /mnt/pve/hdd/hermes,mp=/data \
-mp1 /media/2tb/hermes,mp=/fast
pct reboot 118
```
Expected: no error output from `pct set`; container reboots.
- [ ] **Step 2: Verify config + writable mounts**
```sh
pct config 118 | grep -E 'features|mp0|mp1'
pct exec 118 -- sh -c 'touch /data/.w /fast/.w && ls -l /data/.w /fast/.w && rm /data/.w /fast/.w && echo MOUNTS_OK'
```
Expected: `features: keyctl=1,nesting=1`, `mp0: /mnt/pve/hdd/hermes,mp=/data`, `mp1: /media/2tb/hermes,mp=/fast`, and `MOUNTS_OK` (proves the unprivileged container's root can write to both bind mounts).
---
## Task 8: Bootstrap script (workstation authoring)
**Files:**
- Create: `scripts/hermes-bootstrap.sh`
This script is authored and committed on the workstation, then **run inside the LXC console** in Task 9. It contains NO real secrets — only placeholders the operator edits in-container.
- [ ] **Step 1: Write `scripts/hermes-bootstrap.sh`**
```bash
#!/usr/bin/env bash
# Hermes Agent bootstrap — run INSIDE the hermes LXC (#118) console, once.
# Prereqs (already done): features nesting/keyctl set, /data and /fast bind mounts present.
set -euo pipefail
LITELLM_BASE_URL="http://10.1.10.22:4000/v1" # litellm gateway (#117)
HERMES_DATA="/opt/hermes" # ~/.hermes equivalent on rootfs (fast)
COMPOSE_DIR="/opt/hermes-stack"
echo "==> 1/5 Install rootful Docker + compose plugin"
apt-get update
apt-get install -y ca-certificates curl gnupg
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc
. /etc/os-release
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian ${VERSION_CODENAME} stable" \
> /etc/apt/sources.list.d/docker.list
apt-get update
apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
systemctl enable --now docker
docker run --rm hello-world >/dev/null && echo " docker OK"
echo "==> 2/5 Prepare data + workspace dirs"
mkdir -p "${HERMES_DATA}" "${COMPOSE_DIR}"
# /data (hdd, bulk) and /fast (2tb ssd) are the bind mounts from the LXC.
mkdir -p /data/workspace /fast/workspace
echo "==> 3/5 Write docker-compose.yml"
cat > "${COMPOSE_DIR}/docker-compose.yml" <<EOF
services:
hermes:
image: nousresearch/hermes-agent:latest
container_name: hermes
restart: unless-stopped
command: gateway run
shm_size: "1g" # browser tools (Playwright/Chromium)
volumes:
- ${HERMES_DATA}:/opt/data # config, memory, skills, sessions (rootfs/SSD)
- /data:/data # bulk workspace (hdd 14TB)
- /fast:/fast # fast workspace (2tb SSD)
env_file:
- ${COMPOSE_DIR}/.env
deploy:
resources:
limits:
memory: 3G
cpus: "2.0"
EOF
echo "==> 4/5 Write .env (EDIT secrets before 'gateway run')"
if [ ! -f "${COMPOSE_DIR}/.env" ]; then
cat > "${COMPOSE_DIR}/.env" <<EOF
# --- litellm gateway (OpenAI-compatible) ---
OPENAI_BASE_URL=${LITELLM_BASE_URL}
OPENAI_API_KEY=REPLACE_WITH_LITELLM_KEY
# --- messaging connectors (fill the ones you use) ---
TELEGRAM_BOT_TOKEN=
DISCORD_BOT_TOKEN=
SLACK_BOT_TOKEN=
EOF
chmod 600 "${COMPOSE_DIR}/.env"
echo " wrote ${COMPOSE_DIR}/.env — edit OPENAI_API_KEY + bot tokens now."
fi
echo "==> 5/5 First-time interactive setup (model -> litellm, sandbox=local, connectors)"
echo " Run setup, then start the gateway:"
echo " cd ${COMPOSE_DIR}"
echo " docker compose run --rm hermes setup # pick provider=custom, base_url=${LITELLM_BASE_URL}, sandbox=local"
echo " docker compose up -d # start 'gateway run'"
echo " docker compose logs -f hermes"
echo "Done. (config.yaml lives under ${HERMES_DATA}; secrets stay in ${COMPOSE_DIR}/.env)"
```
- [ ] **Step 2: Lint the script**
Run: `shellcheck scripts/hermes-bootstrap.sh` (if `shellcheck` is unavailable, run `bash -n scripts/hermes-bootstrap.sh`)
Expected: no errors (info/style notes acceptable). `bash -n` prints nothing on success.
- [ ] **Step 3: Mark executable + commit**
```bash
chmod +x scripts/hermes-bootstrap.sh
git add scripts/hermes-bootstrap.sh
git commit -m "feat: add Hermes Agent in-container bootstrap script"
```
---
## Task 9: Run bootstrap + finalize (PVE console for run, workstation for docs)
**Files:**
- Modify: `README.md`
- [ ] **Step 1: Get the script into the LXC and run it (LXC console)**
The script lives in the repo on the workstation. Get its contents into the container — easiest via the LXC's web-console shell: open an editor (`nano /root/hermes-bootstrap.sh`) and paste the file, or pipe it through the host with `pct exec 118 -- tee /root/hermes-bootstrap.sh` while pasting. Then:
```sh
pct exec 118 -- bash /root/hermes-bootstrap.sh
```
Expected: script reaches `Done.` with `docker OK`. Then, inside the container, edit `/opt/hermes-stack/.env` (litellm key + bot tokens) and run the `docker compose run --rm hermes setup` / `up -d` lines it printed.
- [ ] **Step 2: Update `README.md` structure table**
Add these rows after the `pbs-variables.tf` row:
```markdown
| `hermes.tf` | Hermes Agent LXC 컨테이너 정의 (token-safe skeleton) |
| `hermes-variables.tf` | Hermes 관련 변수 |
| `scripts/hermes-bootstrap.sh` | Hermes 인-컨테이너 설치 스크립트 |
```
- [ ] **Step 3: Append a deploy-flow section to `README.md`**
```markdown
## Hermes Agent (LXC #118)
litellm(#117, `10.1.10.22:4000`)을 LLM 게이트웨이로 쓰는 Nous Research Hermes Agent.
배포는 4단계 (bind mount·features는 API 토큰 불가 → 콘솔 `pct set`):
1. 호스트 준비(node1 콘솔): `mkdir -p /mnt/pve/hdd/hermes /media/2tb/hermes && chown 100000:100000 /mnt/pve/hdd/hermes /media/2tb/hermes`
2. `terraform apply` (컨테이너 생성)
3. node1 콘솔: `pct set 118 -features nesting=1,keyctl=1 -mp0 /mnt/pve/hdd/hermes,mp=/data -mp1 /media/2tb/hermes,mp=/fast && pct reboot 118`
4. LXC 콘솔: `scripts/hermes-bootstrap.sh` 실행 → `/opt/hermes-stack/.env` 채우고 `docker compose run --rm hermes setup``docker compose up -d`
> 비밀값(litellm 키·봇 토큰)은 컨테이너의 `/opt/hermes-stack/.env`에만 두고 repo에 커밋하지 않는다.
> TODO: hermes `mp0/mp1`는 TF state에 없음 → 추후 `terraform import`로 따라잡기.
```
- [ ] **Step 4: Commit docs**
```bash
git add README.md
git commit -m "docs: document Hermes Agent deploy flow"
```
---
## Task 10: End-to-end verification
**Files:** none
- [ ] **Step 1: Container + Docker health (node1 console)**
```sh
pct exec 118 -- docker ps --format '{{.Names}} {{.Status}}'
```
Expected: `hermes Up ...` (healthy/running).
- [ ] **Step 2: LLM path through litellm (LXC console)**
```sh
pct exec 118 -- curl -s http://10.1.10.22:4000/v1/models -H "Authorization: Bearer $(grep OPENAI_API_KEY /opt/hermes-stack/.env | cut -d= -f2)" | head -c 400
```
Expected: a JSON model list from litellm (proves hermes's network path + key reach the gateway). Note the model id(s) — set Hermes `model.default` to one of these during `setup`.
- [ ] **Step 3: Workspace persistence on the big disk (node1 console)**
```sh
pct exec 118 -- sh -c 'echo hi > /data/workspace/_probe.txt'
cat /mnt/pve/hdd/hermes/workspace/_probe.txt && rm /mnt/pve/hdd/hermes/workspace/_probe.txt
```
Expected: `hi` printed from the **host** path — proves the agent's `/data` writes land on `/mnt/pve/hdd/hermes` (14TB disk).
- [ ] **Step 4: Messaging connector end-to-end (manual)**
Send a test message from the configured platform (e.g. Telegram) to the bot; confirm Hermes replies. Check `docker compose logs -f hermes` for the round-trip.
- [ ] **Step 5: Final commit (if any uncommitted state/docs)**
```bash
git add -A && git commit -m "chore: Hermes Agent LXC deploy verified" || echo "nothing to commit"
```
---
## Notes / Follow-ups
- **TF import:** add the `mp0/mp1` bind mounts to TF state later via `terraform import` once a root@pam/SSH path is available (same outstanding task as 115/700 in `nfs-lxc-sharing-redesign`).
- **Sandbox:** start `local`; revisit Docker sandbox backend (DinD) only if subagent isolation is needed.
- **Memory:** after deploy, record the hermes LXC + the API-token-can't-bind-mount/features constraint in project memory.