Devlog #35: Infrastructure lessons at scale

2024-07-04 · From the vault

What we've learned running the world for two years. Failover, monitoring, and the night we don't want to repeat. War stories, sanitized. Lessons, shared.

Two years of open beta. We've had nights where something broke and we didn't know why for hours. We've fixed that: observability is now non-negotiable. When the world degrades, we need to attribute the cause to a specific build, region, or resource bottleneck—and ideally respond automatically. That's the control plane we're building: not just "servers run" but "we know why they're running (or not), and we can act."

Failover and the virtuous cycle

Better instrumentation → faster iteration → more ambitious live ops → more value from further automation. We've learned that the hard way. We've also learned that a multiplayer world needs more than server orchestration. It needs matchmaking as policy—who gets placed where, and why. And it needs real-time safety: risk assessment, enforcement, abuse prevention. Not as an afterthought. As part of the runtime. The night we don't want to repeat taught us that. We're building so we don't have to repeat it.

← Newer · Older → · Blog