The Build System Illusion: What We Lose When Everything Looks Like a Cloud Deploy
Someone on my team asked yesterday: "Can we just containerize this and deploy it like everything else?"
We were talking about pushing a security agent to endpoint devices. Not EC2 instances. Not Kubernetes nodes. Actual routers sitting in apartments across Taiwan, maybe getting power-cycled when someone's toddler finds the button.
The question reveals something interesting about how we think about infrastructure now. We've spent a decade building elegant abstractions for cloud deployments, and somewhere along the way we convinced ourselves these abstractions are universal. They're not. And when you try to force them onto problems they weren't designed for, you start discovering all the assumptions baked into your tooling.
This isn't a tutorial. This is me documenting what breaks when you stop deploying to datacenters you control.
What We Forgot While Building CI/CD Pipelines
Every CI/CD tutorial follows the same script: code → test → build → deploy. The tutorials work because they make assumptions they never mention out loud.
Your deployment target can reach out and pull updates. It has enough bandwidth to do this efficiently. When something fails, you can just retry. If the retry fails, you can roll back by deploying something else. The network is reliable enough that these operations complete in reasonable time.
These aren't features of "deployment." These are features of deploying to infrastructure you control.
I started mapping our current pipeline to see what would need to change for distributed device deployment. Standard setup:
- Git push triggers CI
- Run tests, security scans, build artifacts
- Push to artifact registry
- Orchestrator pulls and distributes
This is elegant. It works beautifully. And it will likely fall apart when your deployment target is 10,000 clients that:
- Might be offline when you push the update
- Might stay offline for days
- Might have 5 Mbps connections shared with a family streaming Netflix
- Might fail the update halfway through and just... stay that way
- Might be running old firmware you didn't even know existed
The scary part isn't that the pipeline doesn't work. The scary part is how long it took us to realize it wouldn't work. We've gotten so good at cloud deployments that we've forgotten these are solved problems for a very specific environment.
What Breaks First: The Optimistic Network Assumptions
Cloud deployments assume the network is basically a non-issue. Sure, you might have a timeout here or there, but fundamentally you trust that HTTP requests complete and your orchestrator can talk to your services.
This assumption dies immediately with distributed devices.
Picture this: You push an update at 2 AM (because that's when traffic is lowest). Of your 10,000 clients, maybe 7,000 are online. They start pulling the artifact. Except it's 50MB, and some of these devices are on rural connections, and now you're essentially DDoS'ing your own artifact registry.
The ones that do start downloading—what happens when the connection drops halfway through? In Kubernetes, a failed pull just reschedules. The container runtime handles it. With a router? You might have a partially written filesystem and a device that won't boot on the next power cycle.
"Just use resumable downloads," someone will say. Sure. But now you need:
- Devices to track download state locally
- Verification that partial downloads aren't corrupted
- A way to clean up failed attempts that don't complete
- Some kind of backoff so 3,000 devices don't hammer your CDN simultaneously when they retry
You're rebuilding a download manager. For every device type. Because you can't just docker pull anymore.
The deeper issue is that cloud infrastructure trained us to ignore these problems. When was the last time you thought about how kubectl apply actually gets your deployment to the nodes? You don't, because Kubernetes handles it. That's the abstraction working.
But abstractions leak, and when they leak at the edge, you're left rebuilding things you thought were solved problems.
The Prototype: Reinventing What We Thought Was Solved
I built a prototype because I wanted to understand what we're actually trading when we move from cloud to edge.
The architecture isn't novel:
- Build system creates signed artifacts (same as before)
- Artifacts get split into chunks with content-addressed storage
- Regional edge nodes cache the chunks
- Devices pull from nearest node when online
- Each device verifies integrity before applying
- Interrupted downloads can resume
This is roughly what every OTA update system does. Which raises the question: why doesn't everyone just use an existing OTA framework?
Because OTA frameworks assume you're updating firmware. They live in a world where:
- Updates happen maybe monthly
- You're replacing the entire system image
- The device reboots as part of the process
- "Rollback" means keeping two full system partitions
We're trying to update application code. Potentially daily. Without rebooting. While the application is serving traffic. With minimal storage overhead.
The existing solutions don't fit. So you end up rebuilding pieces of:
- Package managers (dependency resolution, version management)
- Container runtimes (layered filesystems, atomic updates)
- Service orchestrators (health checking, rollback logic)
- Download managers (chunking, resume, verification)
All the things we thought were solved because Kubernetes and Docker handle them for us. Except now you can't use Kubernetes or Docker because they assume resources you don't have.
What You Give Up: The Observability Black Hole
Here's where it gets uncomfortable.
In cloud deployments, observability is basically solved. Prometheus scrapes metrics. Logs stream to Elasticsearch. Traces go to Jaeger. You can see what's happening across your entire fleet in real-time.
With distributed devices? You're flying blind.
A device pulls an update. It applies the update. Maybe it works. Maybe it doesn't. You won't know unless:
- The device can phone home (costs bandwidth)
- It's currently online (can't assume this)
- Your telemetry isn't broken (but how would you know?)
- The update didn't break the telemetry itself (classic chicken-egg)
I added a lightweight telemetry system to my prototype. Devices queue status updates locally and flush when they have connectivity. Sounds reasonable until you realize:
If an update breaks the device, the telemetry can't report the failure. So you see... nothing. Just silence. Which could mean "device is offline" or "device is broken" or "device is fine but busy" or "telemetry is broken but device is fine."
You've lost the feedback loop that makes cloud development tolerable. Push a bad update and you might not know for hours or days. By then, understanding what failed requires physical access to the device.
This isn't a solvable problem. It's a fundamental trade-off. You're giving up observability for distribution. The question is whether you understand what you're trading.
The Parts Where Cloud Thinking Fails Completely
Rollback: The Illusion of Control
Cloud deployment rollback is simple. You deploy version N-1. The orchestrator handles it. Done.
This works because you control both ends of the transaction. You control when the rollback happens. You control which nodes get rolled back first. You can verify the rollback worked.
Now try rolling back 10,000 distributed devices. You push the rollback command. What happens?
3,000 devices are offline. They'll get the rollback... eventually. Maybe tomorrow. Maybe next week. 2,000 devices are in the middle of applying the broken update. Do they abort? Do they finish and then rollback? 1,000 devices already applied the update and have been running it for 6 hours. Their local state might be incompatible with the old version.
The remaining 4,000 successfully rollback. Probably. You think. You won't actually know for a while because of the observability problem.
What you've lost is the ability to reason about system state. In Kubernetes, rollback is atomic (or near enough). With distributed devices, you're managing a multi-day eventually-consistent rollback process where you can't observe success and can't guarantee completion.
The solution? Most OTA systems don't really support rollback. They support "keep the old version around and boot into it if the new version crashes repeatedly." Which helps with catastrophic failures but does nothing for subtle bugs that don't crash.
You're rediscovering why mobile app developers test so carefully before release. Because once it's deployed, "rollback" isn't really an option.
Heterogeneity: When Your Build Matrix Explodes
Cloud infrastructure is beautifully uniform. You might have different instance types, but they're all running basically the same OS, same CPU architecture, same capabilities.
Edge devices laugh at this uniformity.
Half your devices have 512MB RAM. The other half have 2GB. Some have hardware crypto. Some don't. Different CPU architectures. Different kernel versions. Different storage constraints.
Do you:
- Build separate artifacts for each variant? (Your CI now builds 15 different versions)
- Build one binary with feature detection? (Bloated, complex, testing nightmare)
- Modular plugins? (Adds deployment complexity, version compatibility matrix)
Each choice trades something. More build complexity for smaller binaries. Simpler builds for harder testing. Runtime efficiency for deployment flexibility.
The cloud abstracted this away. EC2 instances are predictable. Kubernetes nodes are predictable. You built your tools for predictability.
Now you're remembering why embedded developers have such elaborate build systems. Because building for heterogeneous hardware is genuinely complicated, and there's no abstraction that makes it simple without hiding real constraints.
State Management: The Distributed Database You Didn't Mean to Build
Here's a fun one: device configuration.
In cloud deployments, configuration is in git. You push changes, they roll out, done. Want to update a feature flag? Change the value, deploy.
With distributed devices, configuration becomes a distributed database problem.
Device A gets new config version 5. Device B is offline, still running config version 3. Device C failed to apply version 4 and rolled back to version 3, but its local state reflects changes from version 4.
Now a user reports a bug. Which config version were they running? Which version should they be running? Is the bug because of the config, or because of the application, or because of interaction between mismatched versions?
You can't just "check the current config" because there is no current config. There are 10,000 different configs in various states of consistency.
The cloud solution—configuration management tools, service discovery, distributed consensus—doesn't work when devices are offline more often than they're online.
You're reinventing eventual consistency patterns from distributed databases. Except you're doing it for configuration management. And you're discovering why distributed databases are hard.
What We Forgot While Building Kubernetes
This started as a simple deployment problem. It turned into a reminder about how much knowledge we've lost.
A generation of developers now knows how to deploy to Kubernetes but not how to build systems that don't assume Kubernetes exists. We know how to write microservices but not how to design for intermittent connectivity. We know how to scale horizontally but not how to handle actual physical distribution.
This isn't their fault. The abstractions worked so well that we stopped teaching the fundamentals they abstracted away.
Want to deploy an update? kubectl apply. Want to rollback? kubectl rollout undo. Want observability? Install Prometheus. These are solved problems. Solved so thoroughly that we forgot what the problems were.
Edge computing is forcing us to remember.
The network isn't reliable. Deployment targets aren't uniform. State isn't eventually consistent—it's just inconsistent. Rollback isn't atomic. Observability isn't centralized. Configuration isn't unified.
These aren't bugs in edge computing. These are properties of distributed systems that cloud infrastructure successfully hid from us. We thought we'd solved distributed systems. We just built a very expensive abstraction that works when you control the datacenter.
What This Means for the Next Decade
We're about to see a lot of bad architecture. Not because people are incompetent, but because the tooling they know doesn't apply and the knowledge base to build new tooling has atrophied.
Mobile developers know these problems. Embedded systems engineers know these problems. Game developers know these problems.
But there's a whole generation of backend/cloud engineers who've never had to think about:
- Offline-first operation
- Bandwidth-constrained deployments
- Devices that can't be SSH'd into
- Updates that take days to propagate
- Debugging without centralized logs
- State that's distributed by default, not by choice
The industry is hiring cloud engineers to build edge systems. Then expressing surprise when the solutions look like complex, expensive reinventions of problems that were solved in embedded systems 20 years ago.
We need to rediscover institutional knowledge we thought we didn't need anymore. Or admit that "edge computing" is just going to be "cloud computing with extra steps and worse performance."
The Reading List Nobody Talks About
The resources I've found useful aren't from cloud computing thought leaders. They're from the places that never forgot these problems:
- SWUpdate documentation (embedded Linux) - Atomic updates, rollback strategies
- Mender architecture - OTA updates for embedded devices
- Chrome OS update_engine - Delta updates, cryptographic verification
- FreeRTOS OTA - Working with device constraints
- Balena documentation - Fleet management for IoT
These aren't sexy. They're not written by FAANG engineers blogging about scaling to billions of requests. They're written by people who've been deploying to heterogeneous hardware with unreliable networks for decades.
The irony is that we're now reinventing their solutions, but worse, because we're trying to apply cloud patterns to problems cloud patterns don't fit.
What I'm Actually Building (And Why)
My prototype isn't trying to be production-ready. It's trying to understand the problem well enough to recognize good solutions when I see them.
Right now I'm working through:
- Chunk size optimization (smaller = better resumability, larger = less overhead)
- Testing at scale without 10,000 physical devices
- Security/verification vs. resource constraints trade-offs
- Configuration schema migrations when updates aren't atomic
These aren't new problems. But they're new to me, because cloud infrastructure let me ignore them for a decade.
The goal isn't to build another OTA framework the world doesn't need. The goal is to develop enough understanding that I can make informed architectural decisions when edge deployment becomes unavoidable.
Which, if current trends continue, will be soon.
Because the next wave of computing isn't happening in data centers. It's happening in cars, factories, homes, and cities. And all those devices need software updates.
Your CI/CD pipeline wasn't designed for this. Neither was mine.