A better Kubernetes, from the ground up

Recently I had a chat with the excellent Vallery Lancey, about Kubernetes. Specifically, what we would do differently if we built something new, from the ground up, with no regard for compatibility with Kubernetes. I found that conversation so stimulating that I feel the need to write things down, so here we are.

Before we get started, I want to stress a few things.

Guiding principles

My experience of Kubernetes comes from two very different places: authoring MetalLB for bare metal clusters, and operating a large fleet of clusters-as-a-service in GKE SRE. Both of these taught me that Kubernetes is extremely complex, and that most people who are trying to use it are not prepared for the sheer amount of work that lies between the marketing brochure and the system those brochures promise.

MetalLB taught me that it’s not possible to build robust software that integrates with Kubernetes. I think MetalLB makes a damn good go of it, but Kubernetes still makes it far too easy to construct broken configurations, and far too hard to debug them. GKE SRE taught me that even the foremost Kubernetes experts cannot safely operate Kubernetes at scale. (Although GKE SRE does a spectacular job with the tools they’re given.)

Kubernetes is the C++ of orchestration software. Immensely powerful, includes all the features, looks deceptively simple, and will hurt you repeatedly until you join its priesthood and devote your life to its mysteries. And even then, the matrix of possible ways to configure and deploy it is so large that you’re never on firm footing.

Continuing that analogy, my guide star is Go. If Kubernetes is C++, what would the Go of orchestration systems look like? Aggressively simple, opinionated, grown slowly and warily, and you can learn it in under a week and get on with what you were actually trying to accomplish.

With that, let’s get going. Starting with Kubernetes, and with a license to completely and utterly break compatibility, what would I do?

Mutable pods

In Kubernetes, pods are mostly (but not entirely) immutable after creation. If you want to change a pod, you don’t. Make a new one and delete the old one. This is unlike most other things in Kubernetes, which are mostly mutable and gracefully reconcile towards the new spec.

So, I’m going to make pods be not special. Make them entirely read-write, and reconcile them like you would any other object.

The immediately useful thing I get from that is in-place restarts. If scheduling constraints and resource allocations haven’t changed, guess what? SIGTERM runc, restart runc with different parameters, and you’re done. Now pods look like regular old systemd services, that can move between machines if necessary.

Note that this doesn’t require doing mutability at the runtime layer. If you change a pod definition, it’s still mostly fine to terminate the container and restart it with a new configuration. The pod is still holding onto the resource reservation that got it scheduled onto this machine, so conceptually it’s equivalent to systemctl restart blah.service. You could try to be fancy and make some operations actually update in place at the runtime level as well, but don’t have to. The main benefit is decoupling scheduling, pod lifetime, and lifetime at the runtime layer.

Version control all the things

Sticking at the pod layer for a bit longer: now that they’re mutable, the next obvious thing I want is rollbacks. For that, let’s keep old versions of pod definitions around, and make it trivial to “go back to version N”.

Now, a pod update looks like: write an updated definition of the pod, and it updates to match. Update broken? Write back version N-1, and you’re done.

Bonus things you get from this: a diffable history of what happened to your cluster, without needing GitOps nonsense. By all means keep the GitOps nonsense if you want, it has benefits, but you can answer a basic “what changed?” question using only data in the cluster.

This needs a bit more design. In particular, I want to separate out external changes (human submits a new pod) from mechanical changes (some internals of k8s alter a pod definition). I haven’t thought through how to encode both those histories and make both accessible to operators and automation. Maybe it could also be completely generic, wherein a “changer” identifies itself when submitting a new version, and you can then query for changes by or excluding particular changers (think similar to how label queries work at the minute). Again, more design needed there, I just know that I want versioned objects with an accessible history.

We’ll need garbage collection eventually. That said, changes to single pods should delta-compress really well, so my default would be to just keep everything until it becomes a truly dumb amount of data, and figure something out at that point. Keeping everything also acts as a useful mild pressure to avoid “death by a thousand changes” in the rest of the system. Prefer to have fewer, more meaningful changes over a flurry of control loops each changing one field in pursuit of convergence.

Once we have this history, we can do some neat minor things too. For example, the node software could keep container images for the last N versions pinned to the machine, so that rollbacks are as fast as they can possibly be. With an accessible history, you can do this more precisely than “GC older than 30 days and hope”. Generalizing, all the orchestration software can use older versions as GC roots for various resources, to make rollbacks faster. Rollbacks being the primary way of ending outages, this is a very valuable thing to have.

Replace Deployment with PinnedDeployment

This is a short section to basically say that Vallery knocked it out of the park with her PinnedDeployment resource, which lets operators explicitly control a rollout by tracking 2 versions of the deployment state. It’s a deployment object designed by an SRE, with a crisp understanding of what SREs want in a deployment. I love it.

This combines super well with the versioned, in-place pod updates above, and I really don’t have anything to add. It’s clearly how multi-pod things should work. There’s probably some tweaking required to adapt from the Kubernetes-constrained world to this new wonderful unconstrained universe, but the general design is perfect.

Explicit orchestration workflows

The biggest issue I have with the “API machinery” bits of Kubernetes is the idea of orchestration as a loose choreography of independent control loops. On the surface, this seems like a nice idea: you have dozens of little control loops, each focused on doing one small thing. When combined in a cluster, they indirectly cooperate with each other to push the state forward and converge on the desired end state. So, what’s the problem?

The problem is that it’s entirely impossible to debug when it goes wrong. A typical failure mode in Kubernetes is that you submit a change to the cluster, then repeatedly refresh waiting for stuff to converge. When it doesn’t… Well, you’re screwed. Kubernetes doesn’t know the difference between “the system has converged successfully” and “a control loop is wedged and is blocking everything else.” You can hope that the offending control loop posted some events to the object to help you, but by and large they don’t.

At which point your only option is to cat the logs of every control loop that might be involved, looking for the one that was wedged. You can make this a bit faster if you have intimate knowledge of all the control loops and what each one does, because that lets you infer from the object’s current state which loop might be trying to run right now.

The key thing to notice here is that the complexity has been shifted from the designer of the control loop to the cluster operator. It’s easy (though not trivial) to make a control loop that does a dinky little thing in isolation. But to operate a cluster with dozens of these control loops requires the operator to assimilate the behavior of all of them, their interactions with each other, and try to reason about an extremely loosely coupled system. This is a problem because you have to write and test the control loop once, but work with it and its bugs many more times. And yet, the bias is to simplify the thing you only do once.

To fix this, I would look to systemd. It solves for a similar lifecycle problem: given a current state and a target, how do you get from A to B? The difference is that in systemd, the steps and their dependencies are made explicit. You tell systemd that your unit is a required part of multi-user.target (aka “normally-booted happy system”), that it must run after filesystems have been mounted, but before networking it brought up, and so forth. You can also depend on other concrete parts of the system, for example to say that your thing needs to run whenever sshd is running (sounds like a sidecar, right?).

The net result of this is that systemd can tell you precisely what piece of the system malfunctioned, or is still working on its thing, or failed a precondition. It can also print you a graph of the system’s boot process, and analyze it for things like “what’s the long pole of bootup?”

I want to steal all this wholesale, and plop it into my cluster orchestration system. It does need some adjusting to this new world, but roughly: control loops must declare their dependencies on other control loops, must produce structured logs such that I can trivially search for “all control loop activity regarding pod X”, and the orchestration system handles lifecycle events like systemd handles switching to a new target unit.

What does that look like in practice? Let’s focus on pod lifecycle. Probably we’ll define an abstract “running” target, which is where we want to end up - the pod has started and is happy. Working backwards, the container runtime will add a task that happens before “running”, to start the containers. But it should probably not run until storage systems have had a chance to set up networked mounts, so it’ll order itself after a “storage” target. Similarly for networking, container startup wants to happen after the “networking” target.

Now your Ceph control loop schedules itself to run before the “storage” target, since it’s responsible for bringing up the storage. Other storage control loops do the same (local bind mount, NFS, …). Note this means their setups can run concurrently, because they all declare that they want to run before storage is considered ready, but don’t care if they run before or after other stuff. Or maybe they do care! Maybe you wrote a cool storage addon that does something amazing, but NFS mounting has to happen before you can execute. Cool, that’s fine, add a dependency on the nfs-mounts step, and you’re done. Like systemd, we would have both ordering requirements and hard “I need this other thing to function at all” requirements, so you can have graceful optional ordering of steps.

(I’m simplifying a little here and assuming the steps aren’t too intertwined. This generalizes to a more complex flow if needed - but see sections further down about working hard to avoid the need for very complex flows in the first place.)

With this in place, your orchestrator can help you answer “why isn’t my pod starting?” You can simply dump the work graph for the pod, and see which steps have completed, which have failed, which are still running. NFS mounting has been going for 5 minutes? I’m guessing the server’s down and the control loop is missing a timeout. Going back to the observation about the matrix of possible configurations and states being immense: that can be okay, if you provide the tools to debug it. Systemd allows you to add arbitrary anythings to the boot process, in any order, with any constraints. But I can still troubleshoot it when it goes wrong, because the constraints it does have combined with the tooling it offers let me quickly make sense of a particular machine from first principles.

Similar to the benefit systemd brings to system startup, this also lets you parallelize lifecycle operations as aggressively as they can be, but no more. And because the workflow graph is explicit, it’s extensible. Does your cluster have some company-specific step that happens for every pod, and must happen at a specific place in the lifecycle? Define a new intermediate target for that, make it depend on the right pre- and post-requisites, and hook your control loop on there. The orchestration system will ensure that your control loop always gets involved at exactly the right point in the lifecycle.

Note this also fixes the weirdness with things like Istio, where they have to hackily inject themselves into the human-provided definition in order to function. There’s no need for that! Insert the appropriate control loops into the lifecycle graph, and have it adjust things as needed on the inside. No need to muck with operator-provided objects, as long as you can express to the system where in the lifecycle you need to do stuff.

This section is both very long, and way too short. This is a very large departure from k8s’s API machinery, and so would require a lot of new design work to flesh out. In particular, the major change is that control loops no longer simply observe all cluster state and race to do whatever, but have to wait to be called upon by the orchestrator for specific objects, when those objects reach the right point in their lifecycle. You can bolt this onto k8s as it is now using annotations and programming conventions, but the crisp observability and debuggability benefits don’t fully materialize until you burn the existing thing to the ground.

Interestingly, Kubernetes sort-of already has a prototypical implementation of some of these ideas: initializers and finalizers are effectively happens-before hooks for two lifecycle steps. It lets you hook your control loop onto two hardcoded “targets”. They split control loops into three buckets: initialization, “default,” and finalization. It’s the hardcoded beginnings of an explicit workflow graph. I’m arguing to push that to its logical conclusion.

Explicit field ownership

A modest expansion of the previous section: make each field of an object owned explicitly by a particular control loop. That loop is the only one allowed to write to that field. If no owner is defined, the field is writable by the cluster operator, and nothing else. This is enforced by API machinery, not convention.

This is already mostly the case, but the ownerships are implicit. This leads to two problems: if a field is wrong, it’s hard to figure out who’s responsible; and it’s easy to accidentally run “duelling controllers” that fight indefinitely over a field.

The latter is the bane of MetalLB’s existence, wherein it gets into fights with other load-balancer implementations. That should never happen. The orchestrator should have rejected MetalLB’s addition to the cluster, because LB-related fields would have two owners.

There probably needs to be an escape valve that lets you explicitly permit multiple ownership of fields, but I would start without it and see how far we get. Shared ownership is a code smell and/or a bug until proven otherwise.

If you also require explicit registration of what fields a control loop reads (and strip out those it doesn’t - no cheating) this also lets you do exciting things like prove that the system converges (no loops of read->write->read), or at least reason about the long pole in your orchestration.

IPv6 only, mostly

I’m intimately familiar with Kubernetes networking, and as such it’s the piece I most want to rip out wholesale. There are reasons for why it looks the way it does, and I don’t mean to say there’s not a place for k8s networking. But that place is not in my orchestration system. This is going to be a long section, so strap in.

So, for starters, let’s rip out all k8s networking. Overlay networks, gone. Services, gone. CNI, gone. kube-proxy, gone. Network addons, gone.

(That last one is why this could never happen in k8s proper, by the way. By now there’s an ecosystem of companies selling network addons, and you’d better believe they’re not going to stand by and let me get rid of their reason for existing. The first priority of all ecosystems, in nature and software, is to ensure their continued existence. You can’t ask an ecosystem to evolve itself into extinction, you have to trigger extinction from the outside.)

Right, clean slate. We have containers, and they probably do need to talk to each other and the world. What do?

Let’s give every pod an IPv6 address. Yes, only an IPv6 address for now. Where do they come from? Your LAN has a /64 already (if it doesn’t, get with the program, I’m not designing for the past here), so pluck IPs from there. You barely even need to do duplicate address detection, 2^64 is large enough that rolling random numbers will mostly just work. We’ll need a teensy bit of machinery on each node to make neighbour discovery work, so that machines can find where other pods are hosted, but that’s very easy to do and reason about: to the rest of the LAN, a pod appears to be on the machine that’s running it.

Or maybe we just make up a ULA, and do the routing on each node manually. It’s really easy to implement, and the addressing plan remains mostly “pick a random number and you’re done”. Maybe a tiny bit of subsetting so that node-to-node routes are more efficient, but this is all easy stuff.

An annoyance is that clouds love to break basic networking primitives, so the IPAM portion will likely have to remain pluggable (within the workflow model from above), so that we can do things like explain to AWS how the traffic is meant to flow. And of course, using IPv6 will make it impossible to run this on GCP. Hahahaha.

Anyway, there’s a couple of ways to skin this cat, but fundamentally, we’re going to use IPv6 and basic, boring routing between nodes. That takes care of pod<>pod connectivity in as close to zero config as we can get, because IPv6 is a large enough space that we can throw random numbers at the wall and come out on top.

If you have more elaborate connectivity needs, you bolt those on as additional network interfaces and boring, predictable IPv6 routes. Need to secure node-to-node comms? Bring up wireguard tunnels, add routes to push node IPs through the wireguard tunnel, and you’re done. The orchestration system doesn’t need to know about any of this, other than probably adding a small control loop to node lifecycle, such that it doesn’t become ready until the tunnels are up.

Okay, so we have pod<>pod connectivity, and pod<>internet connectivity, albeit IPv6-only. How do we get IPv4 into this?

First off, we decree that IPv4 is only for pod<>internet. Thou shalt use IPv6 within the cluster.

Given that constraint, we can do this a couple of ways. Trivially, we can have each node masquerade IPv4 traffic, and allocate out of some small rfc1918 space (the same space on all nodes) for pods. That lets them reach the IPv4 internet, but it’s all static per-node configuration that doesn’t need to be visible to the cluster at all. You could even entirely hide the IPv4 stuff from the control plane, it’s just an implementation detail of the per-machine runtime.

We could also have some fun with NAT64 and CLAT: make the entire network IPv6-only, but use CLAT to trick pods into thinking they have v4 connectivity. Within the pod, do 4-to-6 translation and send the traffic onwards to a NAT64 gateway. It could be a per-machine NAT64, or a deployment within the cluster, or even a big iron external CGNAT type thing if you’re fancy and need to push large amounts of NAT64 traffic. CLAT and NAT64 are well trodden ground at this point: your cell phone is probably doing exactly that to give you IPv4 internet access.

I would probably start with dumb v4 masquerade (the first option), because the amount of configuration required is minimal, it can all be handled locally by each machine without any cross-contamination, and it’s much easier to get us started. Also, it’s easy to change out later, because it all looks the same to pods, and we’re not allowing network addons to run arbitrary code.

That deals with the outbound side, we have dual-stacked internet access. How do we handle the inbound side? Explicit load-balancers. Don’t try to build it into the core orchestration layer. The orchestrator should focus on one thing: if a packet’s destination IP is a pod IP, deliver that packet to the pod.

As it happens, this should mostly just work for clouds. They want that model anyway so they can sell you the cloud LB. Well, fine, you win this time, just give me a control loop to control your LB, and integrate it with your IPAM such that your VPC understands how to route to my pod IPs.

That leaves bare metal clusters out in the cold, sort-of. I argue this is a good thing, because there is no one-size-fits-all load balancing. If I try to give you a load-balancer, it’s not going to work exactly how you want it to, and at some point you’ll snap and install Istio, at which point all my LB’s complexity is dead weight.

We’re going to focus on doing one thing really well: if you send me a packet for a pod, I’ll get the packet to the pod. You can take that and build LBs based on LVS, nginx, maglev-style things, cloud LBs, F5 boxes, the world’s your oyster. And maybe I’ll even provide a couple “default” implementations, as a treat. I do have lots of opinions about load-balancers, so maybe I can make you a good one. But the key is that the orchestration system knows nothing about any of this, all it does is deliver packets to pods.

I didn’t touch on the IPv4 ingress issue, mostly because again I think it’s something each LB is going to solve in the way that works best for it. Proxying LBs like nginx will just dial backends over IPv6 and have zero issues. Stateless maglev-style LBs can do 4-to-6 translation easily, since there is a standard for v4-mapped IPv6 addresses. The packets will arrive at the pod with a source of ::ffff:1.2.3.4, which the pod can decode back to IPv4 if it feels so inclined… Or just treat it as IPv6, and it gets to pretend like the internet is v6-only. If stateless translation was used, the outbound leg will need some stateful tracking to map back to IPv4, but that’s no worse than the masquerading we’re already doing all over the place when it comes to IPv4, and from the node’s perspective that can be handled entirely with an extra route for ::ffff:0000:0000/96

… Or just don’t

As an alternative to all the above networking stuff, we could just… not. Go back to Borg-style port allocation, where everything runs in the host network namespace, and has to request ports to be allocated to it. Rather than listen on :80, you listen on :%port%, and the orchestration system substitutes that for an unused high port number. So, you end up listening on host-ip:53928 instead.

This is really, really simple. So simple there’s basically nothing to even do. There is some annoying futzing required to allocate the ports in a way that doesn’t result in collisions, which is admittedly a bit of a headache. There’s also the issue of port exhaustion, because 65k ports isn’t actually that much if you have clients that are very chatty. But it’s really, really simple. And I like simple.

We could also go with a hybrid of the two, in the classic Docker style: you run in your own netns, on some ephemeral private IPs. You can use whatever port(s) you want, but the only ones visible to other pods and the world are those that the runtime has been told to expose. And you only get to say which target port to expose, the resulting host port is picked by the container runtime. (With probably some escape hatch where you can tell it you really want port 80, and that acts as a scheduling constraint - similar to k8s’s facility in that area).

The key part of all these is that they drastically simplify the network layer, to the point where I can explain it to someone in a few minutes, and trust that they’ll be able to reason about what’s going on.

The downside is that this pushes the complexity into service discovery. You can’t use “just DNS” as the discovery mechanism, because most DNS clients don’t grok SRV records, and therefore won’t be able to dynamically discover the random port.

Amusingly, the trend to service meshes has made this moot, because people now assume the existence of a local smart proxy that can do whatever service discovery is required, and map that onto a trivial view of the network that’s only visible to the pod that needs it. I’m loathe to embrace that however, because service meshes add so much complexity and pain that I don’t want to mandate them… At least not until someone else figures out how to make them work well.

So, we could do something service-mesh-ey but simpler, and do some clever ip:port translation on the source host… but this is starting to look a lot like kube-proxy, with all the attendant complexity and debugging challenges (you can’t just tcpdump at various places, because the traffic mutates between hops).

So, that suggests to me that explicit host port mapping could be a thing, but there’s a lot of hidden complexity there (which I believe is why Kubernetes went with ip-per-pod in the first place). Borg solves that complexity by fiat, decreeing that Thou Shalt Use Our Client Libraries For Everything, so there’s an obvious point at which to plug in arbitrarily fancy service discovery and load-balancing. Unless we go full service mesh, we don’t have that luxury.

I’m still thinking on the options here, but I think I prefer the previous section’s implementation. It’s a bit more upfront complexity, but the result is a composable, debuggable, understandable system that doesn’t need to grow without bounds to meet new needs.

Security is yes

After that long networking discussion, a short bit on security. The default should be maximum sandboxing, with explicit double-opt-in steps required to weaken it.

That means actually using all the great work Jessie Frazelle did on container security. Turn on the default apparmor and seccomp profiles. Disallow running as root in the container. Use a userns so that even if you manage to escalate to root in the container, that’s not system root. Block all devices and other nonsense by default. No host bind mounts. In effect, write the most restrictive pod security policy you can, then make that the default, and make it hard to depart from the default.

Pod security policies get reasonably close to this, in that they enforce double-opt-in: the cluster operator has to permit you to do unsafe things, and you have to request the unsafe thing explicitly. What they don’t do, sadly, is fix Kubernetes’s terrible defaults. But we’re not caring about backwards compatibility, so we can make the default as secure as we know how.

(Warning: from here on, the sections get more vague and less justified. They’re things I want and feel strongly about, but haven’t thought through nearly as much.)

gVisor? Firecracker?

Speaking of maximum security by default, this is a good time to think on even more extreme sandboxing measures. Could we make gVisor or Firecracker the default, and require double-opt-in work to even graduate to “as secure as possible regular container, but sharing a kernel with the host” ?

I’m on the fence about this. On the one hand, I really want the extreme degree of security these tools promise. On the other, it’s a whole bunch of extra code that now has to run, meaning more potential attack surface and more complexity. And these sandboxes place extreme constraints on what you can do, e.g. anything to do with storage devolves into “no, you can’t have any.” It’s great for some workloads, but maybe making it the default is a step too far?

On the storage front at least, the maturation of virtio-fs would solve a lot of these issues, effectively enabling these sandboxes to bind-mount stuff efficiently without breaking the security model. Maybe we should revisit at that point?…

Very distributed clusters

I guess the hip term for this would be “edge compute”, but I really do just mean that I want all my servers to be under one orchestration system, and operate them as a single unit. That means the computers in my home server rack, the DigitalOcean VM I have, and the couple other servers dotted around the internet. These should all be part of one cluster, and behave as such.

This leads to a couple of departures from k8s. First, nodes should be more independent of the control plane than they currently are. It should be possible to operate for extended periods of time (in the extreme, weeks) without a control plane, as long as there are no machine failures that require rescheduling.

I think this mostly translates to syncing more data down to nodes in persistent storage, so that nodes have everything they need to come back up into the programmed state, even from a cold boot. Conceptually, I want the cluster orchestrator to populate a set of systemd units on each machine, and then switch to a very passive role in the node’s life. It has everything it needs locally, and unless those things need to change, the node is independent of the control plane.

That does lead to questions of what to do about node loss. This is a key signal in a “centralized” cluster to trigger workload rescheduling, but in a distributed universe I’m more likely to go “eh, that’s probably an internet blip and it’ll be back soon.” So, the node lifecycle and how it relates to pod lifecycle would have to change. Maybe you have to pick explicitly whether you want pods to be “HA” (i.e. reschedule aggressively when nodes vanish) or “best effort” (reschedule only when we know for certain a node is dead - i.e. operator action).

One way to view this is that in my “distributed” cluster my pods are more likely to be unreplicated pets. I want to enable web-scale things, but unlike k8s I also want to scale down to “I run one of everything, and I care fairly deeply about where geographically things run.” This is not a view that k8s encourages or helps with, and so we have to depart somewhat from what k8s does here.

Another way to view this would be cluster federation as a first class object. Maybe these spread-out machines are actually individual clusters, with their own control planes, and there just happens to be an amazingly nice way to wield them as one mega-cluster. That could certainly work, and answer some questions about how to decouple nodes from the control plane (answer: don’t, but move the control plane as close as possible to the nodes, and have many of them). I would want the control plane to be extremely lean in that case, because the overhead of doing this in k8s would be comically bad, and I want to be far away from that.

This also raises the difficulty of the networking component, since we now have to link up across WANs. Obviously, my answer to that would be some kind of Tailscale integration, since that’s literally what we do, but maybe we need something more bespoke with fewer moving parts (e.g. no need for the magical NAT traversal, mostly).

VMs as primitives

Note: when I say VMs here, I don’t mean “the things you run Kubernetes on”. I mean virtual machines, which I make go myself on a hypervisor and do arbirary things with. Think Proxmox or ESXi, not EC2.

I want my orchestration system to seamlessly mix containers and VMs. Otherwise, in practice, I’m going to need a separate hypervisor, and I’ll have two different systems to manage.

I’m not sure what this means exactly, other than a rough idea that the functionality provided by kubevirt should end up built-in and feel part of the core system, just as much as containers. This is a fairly large can of worms, because this could mean anything from “let me run a system with virtual floppies” to “run a hypervisor that looks and feels kind of like EC2”, which are very different things. All I know is that I don’t want to run Proxmox and this hypothetical container orchestrator, but I definitely do want both VMs and containers.

How to storage?

Storage is the big nebulous unknown in my mind’s eye. I simply don’t have the requisite experience to have opinions. I feel like CSI is too complex and should get trimmed, but I also don’t have good ideas to put forward, beyond the tiny bit relating to lifecycle workflows above. If I were to build any of this, storage is where I’d likely keep things overly pluggable, and end up regretting it once I learn enough to have opinions.

The end

I’m sure I’ve forgotten some things that I’d like to fix, or some whacky ideas I have in this space. But, if I were to embark on replacing Kubernetes tomorrow, these points are where I would start.

I didn’t mention the other players in this space too much - Hashicorp’s Nomad, Facebook’s Twine, Google’s Borg and Omega, Twitter’s Mesos. Other than Borg, I haven’t used them enough to form strong opinions about them, and if I were to embark on this project I’d definitely invest more time getting to know them first, so I could steal their good ideas and avoid their mistakes. I’d also think deeply about Nix, and how to blend its ideas in with all of this.

And let’s be honest, I’m probably not going to do anything about these ideas. Borg sold me on the idea of this kind of computing, and Kubernetes un-sold me on it. I currently believe that the very best container orchestration system is no container orchestration system, and that effort would be well spent avoiding that rabbit hole at all costs. Obviously, that idea is incompatible with building a container orchestration system :).