We’ve spent the last year building RunOS, a platform that spins up production-ready Kubernetes clusters in 5-10 minutes with databases, message queues, observability, and AI tooling configured.
The Problem
Every team rebuilds the same Kubernetes infrastructure: networking, certificates, monitoring, databases, storage. The existing solutions either lock you into a vendor ecosystem or dump you into raw Kubernetes complexity. We wanted the control of self-hosting without weeks of setup.
Architecture
Our system uses two agent types:
Server agents run on VM hosts and communicate with our backend via gRPC bidirectional streams. When users request a cluster node, the agent provisions a KVM-based VM and bootstraps it.
Node agents run on each Kubernetes node and handle cluster operations, monitoring, and service installations.
Key insight: gRPC streams initiated by agents eliminate firewall configuration and public IP requirements. Agents reach out to our backend, not vice versa.
Why KVM?
– Battle-tested, works great with Ubuntu
– Solid Go bindings via libvirt
– Excellent GPU passthrough for AI workloads like Ollama
– Good isolation/performance balance
Sometimes boring technology is the right choice.
Provisioning Flow
1. User clicks “Create Cluster”
2. Backend selects available server agents
3. gRPC commands sent to provision VMs
4. KVM VMs spin up (Ubuntu Cloud 24.04, 30-60 seconds)
5. Node agents install and connect
6. Kubernetes bootstrap with kubeadm + Cilium
7. WireGuard mesh established between nodes
8. Storage configured (OpenEBS + Longhorn)
9. Cluster ready (5-10 minutes total)
The WireGuard Decision
We manage WireGuard at the OS level, not Kubernetes level. Why?
– Same VPN secures both K8s traffic and SSH access
– Nodes communicate securely even if Kubernetes fails
– Simpler troubleshooting with separated layers
– Easier multi-cluster peering (coming soon)
Our backend orchestrates WireGuard configs across nodes via the agents. Centrally coordinated, locally executed.
Version Management Hell
The hardest problem? Keeping 20+ services compatible across updates.
The platform supports one-click installation of: PostgreSQL, MySQL, ClickHouse, Kafka, RabbitMQ, MinIO, Longhorn, Harbor, Traefik, Grafana, Prometheus, Ollama, LiteLLM, Open WebUI, and more.
Each has opinions about K8s versions, storage, and networking. We use Helm charts, operators, and custom YAML as appropriate. The real work is maintaining compatibility matrices and testing every combination.
Deployment Models
Managed option: Dedicated servers with fixed 8 CPU/16GB instances. KVM handles VM provisioning with GPU passthrough for AI workloads. Strict security since it’s early access.
Self-hosted option: Run node agents on any hardware. Complete tenant isolation since you control infrastructure.
Working on: Self-managed VM hosts with custom sizing.
What’s Next
The agent code will be open source eventually. One company runs three production clusters already. Common feedback: “I can’t believe how fast I went from zero to a working cluster with Postgres, Kafka, and monitoring.”
We’re planning weekly updates here on HackerNews about new features, technical challenges, and production lessons learned building RunOS.
Questions? Happy to discuss architecture in the comments.
Comments URL: https://news.ycombinator.com/item?id=45936611
Points: 1
# Comments: 0
Source: news.ycombinator.com

