---
title: The Kubernetes Alternative I Wish Existed
date: 2023-10-01
draft: true
---

<script>
    import Sidenote from '$lib/Sidenote.svelte';
</script>

I use Kubernetes on my personal server, largely because I wanted to get some experience working with it. It's certainly been helpful in that regard, but after a couple of years I think I can pretty confidently say that it's not the ideal tool for my use-case. Duh, I guess? But I think it's worth talking about _why_ that's the case, and what exactly _would_ be the ieal tool.

## The Kubernetes Way™

Kubernetes is a very intrusive orchestration system. It would very much like the apps you're running to be doing things _its_ way, and although that's not a _hard_ requirement it tends to make everything subtly more difficult when that isn't the case. In particular, Kubernetes is targeting the situation where you:

* Have a broad variety of applications that you want to support,
* Have written all or most of those applications yourself,<Sidenote>"You" in the organizational sense, not the personal one.</Sidenote>
* Need those applications to operate at massive scale, e.g. concurrent users in the millions.

That's great if you're Google, and surprise! Kubernetes is a largely Google-originated project,<Sidenote>I'm told that it's a derivative of Borg, Google's in-house orchestration platform.</Sidenote> but it's absolute _garbage_ if you (like me) are just self-hosting apps for your own personal use and enjoyment. It's garbage because, while you still want to support a broad variety of applications, you typically _didn't_ write them yourself and you _most definitely don't_ need to scale to millions of concurrent users. More particularly, this means that the Kubernetes approach of expecting everything to be aware that it's running in Kubernetes and make use of the platform (via cluster roles, CRD's etc) is very much _not_ going to fly. Instead, you want your orchestration platform to be as absolutely transparent as possible: ideally, a running application should need to behave no differently in this hypothetical self-hosting-focused orchestration system than it would if it were running by itself on a Raspberry Pi in your garage. _Most especially_, all the distributed-systems crap that Kubernetes forces on you is pretty much unnecessary, because you don't need to support millions<Sidenote>In fact, typically your number of concurrent users is going to be either 1 or 0.</Sidenote> of concurrent users, and you don't care if you incur a little downtime when the application needs to be upgraded or whatever.

## But Wait

So then why do you need an orchestration platform at all? Why not just use something like [Harbormaster](https://gitlab.com/stavros/harbormaster) and call it a day? That's a valid question, and maybe you don't! In fact, it's quite likely that you don't - orchestration platforms really only make sense when you want to distribute your workload across multiple physical servers, so if you only have the one then why bother? However, I can still think of a couple of reasons why you'd want a cluster even for your personal stuff:

* You don't want everything you host to become completely unavailable if you bork up your server somehow. Yes, I did say above that you can tolerate some downtime, and that's still true - but especially if you like tinkering around with low-level stuff like filesystems and networking, it's quite possible that you'll break things badly enough<Sidenote>And be sufficiently busy with other things, given that we're assuming this is just a hobby for you.</Sidenote> that it will be days or weeks before you can find the time to fix them. If you have multiple servers to which the workloads can migrate while one is down, that problem goes away.

* You don't want to shell out up front for something hefty enough to run All Your Apps, especially as you add more down the road. Maybe you're starting out with a Raspberry pi, and when that becomes insufficient you'd like to just add more Pis rather than putting together a beefy machine with enough RAM to feed your [Paperless](https://github.com/paperless-ngx/paperless-ngx) installation, your [UniFi controller](https://help.ui.com/hc/en-us/articles/360012282453-Self-Hosting-a-UniFi-Network-Server), your Minecraft server(s), and your [Matrix](https://matrix.org) server.

* You have things running in multiple geographical locations and you'd like to be able to manage them all together. Maybe you built your parents a NAS with Jellyfin on it for their files and media, or you run a tiny little proxy (another Raspberry Pi, presumably) in your grandparents' network so that you can inspect things directly when they call you for help because they can't print their tax return.

Okay, sure, maybe this is still a bit niche. But you know what? This is my blog, so I get to be unrealistic if I want to.

## So what's different?

Our hypothetical orchestrator system starts out in the same place as Kubernetes--you have a bunch of containerized applications that need to be run, and a pile of physical servers on which you'd like to run them. You want to be able to specify at a high level in what ways things should run, and how many of them, and so on. You don't want to worry about the fiddly details like deciding which container goes on which host, or manually moving all of `odin`'s containers to `thor` when the Roomba runs over `odin`'s power cable while you're on vacation on the other side of the country. You _might_ even want to be able to specify that a certain service should run _n_ replicas, and be able to scale that up and down as needed, though that's a decidedly less-central feature for our orchestrator than it is for Kubernetes. Like I said above, you don't typically need to replicate your services for traffic capacity, so _if_ you're replicating anything it's probably for availability reasons instead. But true HA is usually quite a pain to achieve, especially for anything that wasn't explicitly designed with that in mind, so I doubt a lot of people bother.

So that much is the same. But we're going to do everything else differently.

Where Kubernetes is intrusive, we want to be transparent. Where Kubernetes is flexible and pluggable, we will be opinionated. Where Kubernetes wants to proliferate statelessness and distributed-systems-ism, we will be perfectly content with stateful monotliths.<Sidenote>And smaller things, too. Microliths?</Sidenote> Where Kubernetes expects cattle, we will accept pets. And so on.

The basic resources of servering are ~~wheat~~ ~~stone~~ ~~lumber~~ compute, storage, and networking, so let's look at each in detail.

### Compute

"Compute" is an amalgamate of CPU and memory, with a side helping of GPU when necessary. Obviously these are all different things, but they tend to work together more directly than either of them does with the other two major resources.

### Scheduling

Every orchestrator of which I am aware is modeled as a first-class distributed system: It's assumed that it will consist of more than one instance, often _many_ more than one, and this is baked in at the ground level.<Sidenote>Shoutout to [K3s](https://k3s.io) here for bucking this trend a bit: while it's perfectly capable of functioning in multi-node mode, it's capable of running as a single node and just using SQLite as its storage backend, which is actually quite nice for the single-node use case.</Sidenote>

I'm not entirely sure this needs to be the case! Sure, for systems like Kubernetes that are, again intended to map _massive_ amounts of work across _huge_ pools of resources it definitely makes sense; the average `$BIGCORP`-sized Kubernetes deployment probably couldn't even _fit_ the control plane on anything short of practically-a-supercomputer. But for those of us who _don't_ have to support massive scale, I question how necessary this is.

The obvious counterpoint is that distributing the system isn't just for scale, it's also for resiliency. Which is true, and if you don't care about resiliency at all then you should (again) probably just be using Harbormaster or something. But here's the thing: We care about stuff running _on_ the cluster being resilient, but how much do we care about the _control plane_ being resilient? If there's only a single control node, and it's down for a few hours, can't the workers just continue happily running their little things until told otherwise?

We actually have a large-scale example of something sort of like this in the Cloudflare outage from a while back: Their control plane was completely unavailable for quite a while (over a day if I recall corectly), but their core CDN and anti-DDoS services seemingly continued to function pretty well.

### Virtualization

The de facto standard unit of virtualization in the orchesetration world is the container. Containers have been around in one form or another for quite a while, but they really started to take off with the advent of Docker, because Docker made them easy. I want to break with the crowd here, though, and use a different virtualization primitive, namely:

#### AWS Firecracker

You didn't write all these apps yourself, and you don't trust them any further than you can throw them. Containers are great and all, but you'd like a little more isolation. Enter Firecracker. This does add some complexity where resource management is concerned, especially memory, since by default Firecracker wants you to allocate everything up front. But maybe that's ok, or maybe we can build in some [ballooning](https://github.com/firecracker-microvm/firecracker/blob/main/docs/ballooning.md) to keep things under control.

VM's are (somewhat rightfully) regarded as being a lot harder to manage than containers, partly because (as mentioned previously) they tend to be less flexible with regard to memory requirements, but also because it's typically a lot more difficult to do things like keep them all up to date. Managing a fleet of VM's is usually just as operationally difficult as managing a fleet of physical machines.

But [it doesn't have to be this way!](https://fly.io/blog/docker-without-docker/) It's 2023 and the world has more or less decided on Docker<Sidenote>I know we're supposed to call them "OCI Images" now, but they'll always be Docker images to me. Docker started them, Docker popularized them, and then Docker died because it couldn't figure out how to monetize an infrastructure/tooling product. The least we can do is honor its memory by keeping the name alive.</Sidenote> images as the preferred format for packaging server applications. Are they efficient? Hell no. Are they annoying and fiddly, with plenty of [hidden footguns](https://danaepp.com/finding-api-secrets-in-hidden-layers-within-docker-containers)? You bet. But they _work_, and they've massively simplified the process of getting a server application up and running. As someone who has had to administer a Magento 2 installation, it's hard not to find that appealing.

They're especially attractive to the self-hosting-ly inclined, because a well-maintained Docker image tends to keep _itself_ up to date with a bare minimum of automation. I know "automatic updates" are anathema to some, but remember, we're talking self-hosted stuff here--sure, the occasional upgrade may break your Gitea<Sidenote>Or maybe not. I've been running Gitea for years now and never had a blip.</Sidenote> server, but I can almost guarantee that you'll spend less time fixing that than you would have manually applying every update to every app you ever wanted to host, forever.

So we're going to use Docker _images_ but we aren't going to use Docker to run them. This is definitely possible, as alluded to above. Aside from the linked Fly post though, other [attempts](https://github.com/weaveworks-liquidmetal/flintlock) in the same [direction](https://github.com/firecracker-microvm/firecracker-containerd) don't seem to have taken off, so there's probably a fair bit of complexity here that needs to be sorted out.

## Networking

Locked-down by default. You don't trust these apps, so they don't get access to the soft underbelly of your LAN. So it's principle-of-least-privilege all the way. Ideally it should be possible when specifying a new app that it gets network access to an existing app, rather than having to go back and modify the existing one.

## Storage is yes

Kubernetes is famous for kinda just punting on storage, at least if you're running it on bare metal. Oh sure, there are lots of [storage-related resources](https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/persistent-volume-v1/), but if you look closely at those you'll notice they mostly just _describe_ storage stuff, and leave it up to the cluster operator to bring their own _actual implementation_ that provisions, attaches, maintains, and cleans up the ~~castles in the sky~~ PersistentVolumeClaims and StorageClasses and whatnot.

This makes sense for Kubernetes because, although it took me an embarassingly long time to realize this, Kubernetes has never been about enabling self-hosting. Its primary purpose has always been _cloud mobility_, i.e. enabling you to pick up your cloud-hosted systems and plonk them down over in a completely different cloud. Unfortunately this leaves the self-hosting among us out in the cold: since we don't typically have the luxury of EBS or its equivalent in our dinky little homelabs, we are left to bring our own storage systems, which is something of a [nightmare hellscape of doom](https://kubernetes-csi.github.io/docs/introduction.html).

I want my hypothetical storage system to completely flip this on its head. There _should_ be a built-in storage implementation, and it _should_ be usable when you're self-hosting, with minimal configuration. None of this faffing about with the layers and layers of abstraction hell that Kubernetes forces on you as soon as you give in to the siren song of having a single stateful application in your cluster. If I want to give my application a persistent disk I should jolly well be able to do that with no questions asked.

## Sounds great, but how?

For starters, we're going to give up on synchronous replication. Synchronous replication is one of those things that _sounds_ great, because it makes your distributed storage system theoretically indistinguishable from a purely-local filesystem, but having used a [storage system that prioritizes synchronous replication](https://longhorn.io/) I can pretty confidently say that I would be much happier without it. It absolutely _murders_ performance, causing anything from a 3x-20x slowdown in my [testing](https://serverfault.com/a/1145529/409057), and the worst part of that is that I'm pretty sure it's _completely unnecessary_.

Here's the thing: You only _really_ need synchronous replication if you have multiple instances of some application using the same files at the same time. But nobody actually does this! In _any_ clustering setup I've ever encountered, you handle multi-consumer access to persistent state in one of three ways:

1. You delegate your state management to something else that _doesn't_ need to run multiple copies, i.e. the "replicate your web app but run one DB" approach,
2. You shard your application and make each shard the exclusive owner of its slice of state, or
3. You do something really fancy with distributed systems theory and consensus algorithms.

Here's the thing, though: _none of these approaches require synchronous replication._ Really, the _only_ use case I've found so far for _actually_ sharing state between multiple instances of the same application are things like Docker registry layer storage, which is a special case because it's basically a content-addressed filesystem and therefore _can't_ suffer from write contention. Maybe this is my lack of experience showing, but I have a lot of difficulty imagining a use-case for simultaneous multi-writer access to the same files that isn't better served by something else.

Conceptually, then, our storage system will consist of a set of directories somewhere on disk which we mount into containers, and which are _asynchronously_ replicated to other nodes with a last-writer-wins policy. Actually, we'll probably want to have multiple locations (we can call them "pool" like ZFS does) on disk so that we can expose multiple different types of media to the cluster (e.g. small/fast, large/slow).

This is super simple as long as we're willing to store a full copy of all the data on every node. That might be fine! But I lean toward thinking it's not, because it's not all that uncommon in my experience to have a heterogenous "cluster" where one machine is your Big Storage Monster and other machines are much more bare-bones. There are two basic ways of dealing with this:

1. We can restrict scheduling so that workloads can only be scheduled on nodes that have a copy of their data, or
2. We can make the data accessible over the network, SAN-style.

My inclination is to go with 1) here, because 2) introduces some pretty hefty performance penalties. We could maybe mitigate that with aggressive caching, but now you've got wildly unpredictable performance for your storage based on whether the data is in cache or not. Practically--remember, we're targeting _small_ setups here--I don't think it would be much of a problem to specify a set of nodes when defining a storage pool, or even just make pools a node-local configuration so that each node declares what pools it participates in, and then replicate each pool to every participating node. Again, we're not dealing with Big Data here, we don't need to spread our storage across N machines because it's literally too big to fit on one.

Kubernetes tends to work best with stateless applications. It's not entirely devoid of [tools](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) for dealing with state, but state requires persistent storage and persistent storage is hard in clusters.<Sidenote>In fact, I get the sense that for a long time you were almost completely on your own with storage, unless you were using a managed Kubernetes project like GKE where you're just supposed to use whatever the provider offers for storage. More recently things like Longhorn have begun improving the situation, but "storage on bare-metal Kubernetes" still feels decidedly like a second-class citizen to me.</Sidenote>

Regardless, we're selfhosting here, which means virtually _everything_ has state. But fear not! Distributed state is hard, yes, but most of our apps aren't going to be truly distributed. That is, typically there's only going to be one instance running at a time, and it's acceptable to shut down the existing instance before spinning up a new one. So let's look at what kind of complexity we can avoid by keeping that in mind.

**We don't need strong consistency:** You're probably just going to be running a single instance of anything that involves state. Sure, you can have multiple SQLite processes writing to the same database and it _can_ be ok with that, unless it isn't.<Sidenote> From the SQLite FAQ: "SQLite uses reader/writer locks to control access to the database. But use caution: this locking mechanism might not work correctly if the database file is kept on an NFS filesystem. This is because `fcntl()` file locking is broken on many NFS implementations."</Sidenote> But you _probably_ don't want to run the risk of data corruption just to save yourself a few seconds of downtime.

This means that whatever solution we come up with for storage is going to be distributed and replicated almost exclusively for durability reasons, rather than for keeping things in sync. Which in turn means that it's _probably fine_ to default to an asynchronous-replication mode, where (from the application's point of view) writes complete before they're confirmed to have safely made it to all the other replicas in the cluster. This is good because the storage target will now appear to function largly like a local storage target, rather than a networked one, so applications that were written with the expectation of using local storage for their state will work just fine. _Most especially_, this makes it _actually realistic_ to distribute our storage across multiple geographic locations, whereas with a synchronous-replication model the latency impact of doing that would make it a non-starter.

**Single-writer, multi-reader is default:** With all that said, inevitably people are going to find a reason to try mounting the same storage target into multiple workloads at once, which will eventually cause conflicts. There's only so much we can do to prevent people from shooting themselves in the foot, but one easy win would be to default to a single-writer, multi-reader mode of operation. That way at least we can prevent write conflicts unless someone intentionally flips the enable-write-conflicts switch, in which case, well, they asked for it.

## Configuration

YAML, probably? It's fashionable to hate on YAML right now, but I've always found it rather pleasant.<Sidenote>Maybe people hate it because their primary experience of using it has been in Kubernetes manifests, which, fair enough.</Sidenote> JSON is out because no comments. TOML is out because nesting sucks. Weird niche supersets of JSON like HuJSON and JSON5 are out because they've been around long enough that if they were going to catch on, they would have by now. Docker Swarm config files<Sidenote>which are basically just Compose files with a few extra bits.</Sidenote> are my exemplar par excellence here. (comparison of Kubernetes and Swarm YAML?) (Of course they are, DX has always been Docker's Thing.)

We are also _definitely_ going to eschew the Kubernetes model of exposing implementation details in the name of extensibility.<Sidenote>See: ReplicaSets, EndpointSlices. There's no reason for these to be first-class API resources like Deployments or Secrets, other than to enable extensibility. You never want users creating EndpointSlices manually, but you might (if you're Kubernetes) want to allow an "operator" service to fiddle with them, so you make them first-class resources because you have no concept of the distinction between external and internal APIs.</Sidenote>

## Workload Grouping

It's always struck me as odd that Kubernetes doesn't have a native concept for a heterogenous grouping of pods. Maybe it's because Kubernetes assumes it's being used to deploy mostly microservices, which are typically managed by independent teams--so workloads that are independent but in a provider/consumer relationship are being managed by different people, probably in different cluster namespaces anyway, so why bother trying to group them?

Regardless, I think Nomad gets this exactly right with the job/group/task hierarchy. I'd like to just copy that wholesale, but with more network isolation.