<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[César D. Velandia]]></title><description><![CDATA[Architecting Secure Software | Building with Rust | Automating LLM Flows | Embedded Future]]></description><link>https://cesar.velandia.co/</link><image><url>https://cesar.velandia.co/favicon.png</url><title>César D. Velandia</title><link>https://cesar.velandia.co/</link></image><generator>Ghost 4.1</generator><lastBuildDate>Fri, 06 Mar 2026 00:04:03 GMT</lastBuildDate><atom:link href="https://cesar.velandia.co/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[The Build System Illusion: What We Lose When Everything Looks Like a Cloud Deploy]]></title><description><![CDATA[<p>Someone on my team asked yesterday: &quot;Can we just containerize this and deploy it like everything else?&quot;</p><p>We were talking about pushing a security agent to endpoint devices. Not EC2 instances. Not Kubernetes nodes. Actual routers sitting in living rooms, maybe getting power-cycled when someone&apos;s toddler</p>]]></description><link>https://cesar.velandia.co/build-system-illusion/</link><guid isPermaLink="false">6948ad4715a35701511cccd8</guid><category><![CDATA[DevOps]]></category><dc:creator><![CDATA[César D. Velandia]]></dc:creator><pubDate>Mon, 22 Dec 2025 02:34:56 GMT</pubDate><media:content url="https://cesar.velandia.co/content/images/2025/12/ComfyUI_00604_.png" medium="image"/><content:encoded><![CDATA[<img src="https://cesar.velandia.co/content/images/2025/12/ComfyUI_00604_.png" alt="The Build System Illusion: What We Lose When Everything Looks Like a Cloud Deploy"><p>Someone on my team asked yesterday: &quot;Can we just containerize this and deploy it like everything else?&quot;</p><p>We were talking about pushing a security agent to endpoint devices. Not EC2 instances. Not Kubernetes nodes. Actual routers sitting in living rooms, maybe getting power-cycled when someone&apos;s toddler finds the button (real story).</p><p>The question reveals something interesting about how we think about infrastructure now. We&apos;ve spent a decade building elegant abstractions for cloud deployments, and somewhere along the way we convinced ourselves these abstractions are universal. They&apos;re not. And when you try to force them onto problems they weren&apos;t designed for, you start discovering all the assumptions baked into your tooling.</p><p>This isn&apos;t a tutorial. This is me documenting what breaks when you stop deploying to datacenters you control.</p><h2 id="what-we-forgot-while-building-cicd-pipelines">What We Forgot While Building CI/CD Pipelines</h2><p>Every CI/CD tutorial follows the same script: code &#x2192; test &#x2192; build &#x2192; deploy. The tutorials work because they make assumptions they never mention out loud.</p><p>Your deployment target can reach out and pull updates. It has enough bandwidth to do this efficiently. When something fails, you can just retry. If the retry fails, you can roll back by deploying something else. The network is reliable enough that these operations complete in reasonable time.</p><p>These aren&apos;t features of &quot;deployment.&quot; These are features of deploying to infrastructure you control.</p><p>I started mapping our current pipeline to see what would need to change for distributed device deployment. Standard setup:</p><ul><li>Git push triggers CI</li><li>Run tests, security scans, build artifacts</li><li>Push to artifact registry</li><li>Orchestrator pulls and distributes</li></ul><p>This is elegant. It works beautifully. And it will likely fall apart when your deployment target is 10,000 clients that:</p><ul><li>Might be offline when you push the update</li><li>Might stay offline for days</li><li>Might have 5 Mbps connections shared with a family streaming Netflix</li><li>Might fail the update halfway through and just... stay that way</li><li>Might be running old firmware you didn&apos;t even know existed</li></ul><p>The scary part isn&apos;t that the pipeline doesn&apos;t work. The scary part is how long it took us to realize it wouldn&apos;t work. We&apos;ve gotten so good at cloud deployments that we&apos;ve forgotten these are solved problems for a very specific environment.</p><h2 id="what-breaks-first-the-optimistic-network-assumptions">What Breaks First: The Optimistic Network Assumptions</h2><p>Cloud deployments assume the network is basically a non-issue. Sure, you might have a timeout here or there, but fundamentally you trust that HTTP requests complete and your orchestrator can talk to your services.</p><p>This assumption dies immediately with distributed devices.</p><p>Picture this: You push an update at 2 AM (because that&apos;s when traffic is lowest). Of your 10,000 clients, maybe 7,000 are online. They start pulling the artifact. Except it&apos;s 50MB, and some of these devices are on rural connections, and now you&apos;re essentially DDoS&apos;ing your own artifact registry.</p><p>The ones that do start downloading&#x2014;what happens when the connection drops halfway through? In Kubernetes, a failed pull just reschedules. The container runtime handles it. With a router? You might have a partially written filesystem and a device that won&apos;t boot on the next power cycle.</p><p>&quot;Just use resumable downloads,&quot; someone will say. Sure. But now you need:</p><ul><li>Devices to track download state locally</li><li>Verification that partial downloads aren&apos;t corrupted</li><li>A way to clean up failed attempts that don&apos;t complete</li><li>Some kind of backoff so 3,000 devices don&apos;t hammer your CDN simultaneously when they retry</li></ul><p>You&apos;re rebuilding a download manager. For every device type. Because you can&apos;t just <code>docker pull</code> anymore.</p><p>The deeper issue is that cloud infrastructure trained us to ignore these problems. When was the last time you thought about how <code>kubectl apply</code> actually gets your deployment to the nodes? You don&apos;t, because Kubernetes handles it. That&apos;s the abstraction working.</p><p>But abstractions leak, and when they leak at the edge, you&apos;re left rebuilding things you thought were solved problems.</p><h2 id="the-prototype-reinventing-what-we-thought-was-solved">The Prototype: Reinventing What We Thought Was Solved</h2><p>I built a prototype because I wanted to understand what we&apos;re actually trading when we move from cloud to edge.</p><p>The architecture isn&apos;t novel:</p><ul><li>Build system creates signed artifacts (same as before)</li><li>Artifacts get split into chunks with content-addressed storage</li><li>Regional edge nodes cache the chunks</li><li>Devices pull from nearest node when online</li><li>Each device verifies integrity before applying</li><li>Interrupted downloads can resume</li></ul><p>This is roughly what every OTA update system does. Which raises the question: why doesn&apos;t everyone just use an existing OTA framework?</p><p>Because OTA frameworks assume you&apos;re updating firmware. They live in a world where:</p><ul><li>Updates happen maybe monthly</li><li>You&apos;re replacing the entire system image</li><li>The device reboots as part of the process</li><li>&quot;Rollback&quot; means keeping two full system partitions</li></ul><p>We&apos;re trying to update application code. Potentially daily. Without rebooting. While the application is serving traffic. With minimal storage overhead.</p><p>The existing solutions don&apos;t fit. So you end up rebuilding pieces of:</p><ul><li>Package managers (dependency resolution, version management)</li><li>Container runtimes (layered filesystems, atomic updates)</li><li>Service orchestrators (health checking, rollback logic)</li><li>Download managers (chunking, resume, verification)</li></ul><p>All the things we thought were solved because Kubernetes and Docker handle them for us. Except now you can&apos;t use Kubernetes or Docker because they assume resources you don&apos;t have.</p><h2 id="what-you-give-up-the-observability-black-hole">What You Give Up: The Observability Black Hole</h2><p>Here&apos;s where it gets uncomfortable.</p><p>In cloud deployments, observability is basically solved. Prometheus scrapes metrics. Logs stream to Elasticsearch. Traces go to Jaeger. You can see what&apos;s happening across your entire fleet in real-time.</p><p>With distributed devices? You&apos;re flying blind.</p><p>A device pulls an update. It applies the update. Maybe it works. Maybe it doesn&apos;t. You won&apos;t know unless:</p><ul><li>The device can phone home (costs bandwidth)</li><li>It&apos;s currently online (can&apos;t assume this)</li><li>Your telemetry isn&apos;t broken (but how would you know?)</li><li>The update didn&apos;t break the telemetry itself (classic chicken-egg)</li></ul><p>I added a lightweight telemetry system to my prototype. Devices queue status updates locally and flush when they have connectivity. Sounds reasonable until you realize:</p><p>If an update breaks the device, the telemetry can&apos;t report the failure. So you see... nothing. Just silence. Which could mean &quot;device is offline&quot; or &quot;device is broken&quot; or &quot;device is fine but busy&quot; or &quot;telemetry is broken but device is fine.&quot;</p><p>You&apos;ve lost the feedback loop that makes cloud development tolerable. Push a bad update and you might not know for hours or days. By then, understanding what failed requires physical access to the device.</p><p>This isn&apos;t a solvable problem. It&apos;s a fundamental trade-off. You&apos;re giving up observability for distribution. The question is whether you understand what you&apos;re trading.</p><h2 id="the-parts-where-cloud-thinking-fails-completely">The Parts Where Cloud Thinking Fails Completely</h2><p><strong>Rollback: The Illusion of Control</strong></p><p>Cloud deployment rollback is simple. You deploy version N-1. The orchestrator handles it. Done.</p><p>This works because you control both ends of the transaction. You control when the rollback happens. You control which nodes get rolled back first. You can verify the rollback worked.</p><p>Now try rolling back 10,000 distributed devices. You push the rollback command. What happens?</p><p>3,000 devices are offline. They&apos;ll get the rollback... eventually. Maybe tomorrow. Maybe next week. 2,000 devices are in the middle of applying the broken update. Do they abort? Do they finish and then rollback? 1,000 devices already applied the update and have been running it for 6 hours. Their local state might be incompatible with the old version.</p><p>The remaining 4,000 successfully rollback. Probably. You think. You won&apos;t actually know for a while because of the observability problem.</p><p>What you&apos;ve lost is the ability to reason about system state. In Kubernetes, rollback is atomic (or near enough). With distributed devices, you&apos;re managing a multi-day eventually-consistent rollback process where you can&apos;t observe success and can&apos;t guarantee completion.</p><p>The solution? Most OTA systems don&apos;t really support rollback. They support &quot;keep the old version around and boot into it if the new version crashes repeatedly.&quot; Which helps with catastrophic failures but does nothing for subtle bugs that don&apos;t crash.</p><p>You&apos;re rediscovering why mobile app developers test so carefully before release. Because once it&apos;s deployed, &quot;rollback&quot; isn&apos;t really an option.</p><p><strong>Heterogeneity: When Your Build Matrix Explodes</strong></p><p>Cloud infrastructure is beautifully uniform. You might have different instance types, but they&apos;re all running basically the same OS, same CPU architecture, same capabilities.</p><p>Edge devices laugh at this uniformity.</p><p>Half your devices have 512MB RAM. The other half have 2GB. Some have hardware crypto. Some don&apos;t. Different CPU architectures. Different kernel versions. Different storage constraints.</p><p>Do you:</p><ul><li>Build separate artifacts for each variant? (Your CI now builds 15 different versions)</li><li>Build one binary with feature detection? (Bloated, complex, testing nightmare)</li><li>Modular plugins? (Adds deployment complexity, version compatibility matrix)</li></ul><p>Each choice trades something. More build complexity for smaller binaries. Simpler builds for harder testing. Runtime efficiency for deployment flexibility.</p><p>The cloud abstracted this away. EC2 instances are predictable. Kubernetes nodes are predictable. You built your tools for predictability.</p><p>Now you&apos;re remembering why embedded developers have such elaborate build systems. Because building for heterogeneous hardware is genuinely complicated, and there&apos;s no abstraction that makes it simple without hiding real constraints.</p><p><strong>State Management: The Distributed Database You Didn&apos;t Mean to Build</strong></p><p>Here&apos;s a fun one: device configuration.</p><p>In cloud deployments, configuration is in git. You push changes, they roll out, done. Want to update a feature flag? Change the value, deploy.</p><p>With distributed devices, configuration becomes a distributed database problem.</p><p>Device A gets new config version 5. Device B is offline, still running config version 3. Device C failed to apply version 4 and rolled back to version 3, but its local state reflects changes from version 4.</p><p>Now a user reports a bug. Which config version were they running? Which version should they be running? Is the bug because of the config, or because of the application, or because of interaction between mismatched versions?</p><p>You can&apos;t just &quot;check the current config&quot; because there is no current config. There are 10,000 different configs in various states of consistency.</p><p>The cloud solution&#x2014;configuration management tools, service discovery, distributed consensus&#x2014;doesn&apos;t work when devices are offline more often than they&apos;re online.</p><p>You&apos;re reinventing eventual consistency patterns from distributed databases. Except you&apos;re doing it for configuration management. And you&apos;re discovering why distributed databases are hard.</p><h2 id="what-we-forgot-while-building-kubernetes">What We Forgot While Building Kubernetes</h2><p>This started as a simple deployment problem. It turned into a reminder about how much knowledge we&apos;ve lost.</p><p>A generation of developers now knows how to deploy to Kubernetes but not how to build systems that don&apos;t assume Kubernetes exists. We know how to write microservices but not how to design for intermittent connectivity. We know how to scale horizontally but not how to handle actual physical distribution.</p><p>This isn&apos;t their fault. The abstractions worked so well that we stopped teaching the fundamentals they abstracted away.</p><p>Want to deploy an update? <code>kubectl apply</code>. Want to rollback? <code>kubectl rollout undo</code>. Want observability? Install Prometheus. These are solved problems. Solved so thoroughly that we forgot what the problems were.</p><p>Edge computing is forcing us to remember.</p><p>The network isn&apos;t reliable. Deployment targets aren&apos;t uniform. State isn&apos;t eventually consistent&#x2014;it&apos;s just inconsistent. Rollback isn&apos;t atomic. Observability isn&apos;t centralized. Configuration isn&apos;t unified.</p><p>These aren&apos;t bugs in edge computing. These are properties of distributed systems that cloud infrastructure successfully hid from us. We thought we&apos;d solved distributed systems. We just built a very expensive abstraction that works when you control the datacenter.</p><h2 id="what-this-means-for-the-next-decade">What This Means for the Next Decade</h2><p>We&apos;re about to see a lot of bad architecture. Not because people are incompetent, but because the tooling they know doesn&apos;t apply and the knowledge base to build new tooling has atrophied.</p><p>Mobile developers know these problems. Embedded systems engineers know these problems. Game developers know these problems.</p><p>But there&apos;s a whole generation of backend/cloud engineers who&apos;ve never had to think about:</p><ul><li>Offline-first operation</li><li>Bandwidth-constrained deployments</li><li>Devices that can&apos;t be SSH&apos;d into</li><li>Updates that take days to propagate</li><li>Debugging without centralized logs</li><li>State that&apos;s distributed by default, not by choice</li></ul><p>The industry is hiring cloud engineers to build edge systems. Then expressing surprise when the solutions look like complex, expensive reinventions of problems that were solved in embedded systems 20 years ago.</p><p>We need to rediscover institutional knowledge we thought we didn&apos;t need anymore. Or admit that &quot;edge computing&quot; is just going to be &quot;cloud computing with extra steps and worse performance.&quot;</p><h2 id="the-reading-list-nobody-talks-about">The Reading List Nobody Talks About</h2><p>The resources I&apos;ve found useful aren&apos;t from cloud computing thought leaders. They&apos;re from the places that never forgot these problems:</p><ul><li>SWUpdate documentation (embedded Linux) - Atomic updates, rollback strategies</li><li>Mender architecture - OTA updates for embedded devices</li><li>Chrome OS update_engine - Delta updates, cryptographic verification</li><li>FreeRTOS OTA - Working with device constraints</li><li>Balena documentation - Fleet management for IoT</li></ul><p>These aren&apos;t sexy. They&apos;re not written by FAANG engineers blogging about scaling to billions of requests. They&apos;re written by people who&apos;ve been deploying to heterogeneous hardware with unreliable networks for decades.</p><p>The irony is that we&apos;re now reinventing their solutions, but worse, because we&apos;re trying to apply cloud patterns to problems cloud patterns don&apos;t fit.</p><h2 id="what-im-actually-building-and-why">What I&apos;m Actually Building (And Why)</h2><p>My prototype isn&apos;t trying to be production-ready. It&apos;s trying to understand the problem well enough to recognize good solutions when I see them.</p><p>Right now I&apos;m working through:</p><ul><li>Chunk size optimization (smaller = better resumability, larger = less overhead)</li><li>Testing at scale without 10,000 physical devices</li><li>Security/verification vs. resource constraints trade-offs</li><li>Configuration schema migrations when updates aren&apos;t atomic</li></ul><p>These aren&apos;t new problems. But they&apos;re new to me, because cloud infrastructure let me ignore them for a decade.</p><p>The goal isn&apos;t to build another OTA framework the world doesn&apos;t need. The goal is to develop enough understanding that I can make informed architectural decisions when edge deployment becomes unavoidable.</p><p>Which, if current trends continue, will be soon.</p><p>Because the next wave of computing isn&apos;t happening in data centers. It&apos;s happening in cars, factories, homes, and cities. And all those devices need software updates.</p><p>Your CI/CD pipeline wasn&apos;t designed for this. Neither was mine.</p>]]></content:encoded></item><item><title><![CDATA[The Tool Collector's Fallacy]]></title><description><![CDATA[<p>Most teams have 10+ tools in their DevOps stack. Sometimes it feels like keeping up with DevOps tooling is overwhelming.</p><p>Kubernetes for orchestration. Terraform for infrastructure. ArgoCD for GitOps. Prometheus for metrics. Grafana for dashboards. Loki for logs. Jaeger for traces. Vault for secrets. Trivy for container scanning. Snyk for</p>]]></description><link>https://cesar.velandia.co/the-tool-collectors-fallacy/</link><guid isPermaLink="false">6948b07915a35701511cccf1</guid><category><![CDATA[DevOps]]></category><dc:creator><![CDATA[César D. Velandia]]></dc:creator><pubDate>Wed, 09 Jul 2025 01:54:00 GMT</pubDate><media:content url="https://cesar.velandia.co/content/images/2025/12/ComfyUI_00662_.png" medium="image"/><content:encoded><![CDATA[<img src="https://cesar.velandia.co/content/images/2025/12/ComfyUI_00662_.png" alt="The Tool Collector&apos;s Fallacy"><p>Most teams have 10+ tools in their DevOps stack. Sometimes it feels like keeping up with DevOps tooling is overwhelming.</p><p>Kubernetes for orchestration. Terraform for infrastructure. ArgoCD for GitOps. Prometheus for metrics. Grafana for dashboards. Loki for logs. Jaeger for traces. Vault for secrets. Trivy for container scanning. Snyk for dependency scanning. SonarQube for code quality. Jenkins for legacy pipelines. GitHub Actions for new ones. Slack for notifications. PagerDuty for incidents. Datadog for... something, I forget what Datadog is supposed to do that Prometheus doesn&apos;t.</p><p>Each tool was added to solve a real problem. Each tool made sense at the time. Together, they&apos;ve created a different problem: nobody actually understands the system anymore.</p><h2 id="how-we-became-tool-collectors">How We Became Tool Collectors</h2><p>The pattern is always the same:</p><p><strong>Act 1: The Problem</strong><br>&quot;We need better observability. Prometheus metrics aren&apos;t enough.&quot;</p><p><strong>Act 2: The Research</strong><br>&quot;Grafana Loki looks promising. Integrates with our existing stack. Open source. Good community.&quot;</p><p><strong>Act 3: The POC</strong><br>&quot;Loki works great in our test environment. Let&apos;s roll it out.&quot;</p><p><strong>Act 4: The Rollout</strong><br>&quot;Now we have centralized logging. Problem solved.&quot;</p><p><strong>Act 5: Six Months Later</strong><br>&quot;Why isn&apos;t anyone using Loki? Why are we still grepping container logs?&quot;</p><p>Because adding the tool didn&apos;t solve the problem. It added another thing to learn, another thing to maintain, another thing to debug when it breaks. And it broke last Tuesday when the Loki cluster ran out of disk space and nobody noticed for six hours because we monitor Loki&apos;s health with... Prometheus. Which doesn&apos;t alert when Loki is down because that check was never added.</p><p>We collected a tool. We didn&apos;t solve the problem.</p><h2 id="the-hidden-costs-nobody-talks-about">The Hidden Costs Nobody Talks About</h2><p>Each new tool in your stack has an obvious cost (license, hosting, maintenance) and several hidden ones:</p><p><strong>Cognitive Load Tax</strong><br>Your team now needs to know 24 tools instead of 23. Each with its own:</p><ul><li>Configuration syntax (YAML, HCL, JSON, TOML, all slightly different)</li><li>CLI interface and flags</li><li>API patterns and authentication</li><li>Debugging approaches</li><li>Failure modes</li><li>Update procedures</li></ul><p>A new engineer joining the team used to learn Kubernetes, Terraform, and CI/CD. That was already a lot. Now they also need to learn Loki, Jaeger, Vault, ArgoCD, and whatever we add next quarter.</p><p>We&apos;re not making them more productive. We&apos;re just raising the barrier to productivity.</p><p><strong>Integration Debt</strong><br>Every tool needs to talk to every other tool. Prometheus scrapes Loki&apos;s metrics. Loki ingests logs from Kubernetes. Kubernetes pulls images scanned by Trivy. Trivy&apos;s results go to Slack via Jenkins. Jenkins auth uses Vault. Vault secrets come from Terraform.</p><p>This web of dependencies is fine until something breaks. Then you spend three hours discovering that the Slack notifications stopped because the Vault token expired, which Jenkins didn&apos;t detect properly, which meant Trivy scan results weren&apos;t being posted, which meant nobody noticed the critical CVE in the base image we&apos;ve been deploying all week.</p><p>The tools work. The integrations are fragile.</p><p><strong>The Knowledge Silo Problem</strong><br>Person A knows Terraform. Person B knows Kubernetes. Person C knows Vault. Nobody knows all three well enough to debug the interaction when Kubernetes can&apos;t pull secrets from Vault because Terraform configured the IAM role wrong.</p><p>So you have a meeting. Three people spend two hours teaching each other enough about their respective tools to understand the problem. You fix it. Document it. Move on.</p><p>Next quarter, Person B leaves. New person joins. Doesn&apos;t know Kubernetes. The institutional knowledge evaporated.</p><p>We&apos;ve built a system that requires tribal knowledge to operate. Then we&apos;re surprised when it&apos;s fragile.</p><p><strong>Update Paralysis</strong><br>Remember when updating your infrastructure meant updating one thing? Neither do I.</p><p>Now updates cascade:</p><ul><li>Terraform has a new provider version</li><li>Which requires updating the Vault configuration syntax</li><li>Which breaks the Kubernetes integration</li><li>Which means ArgoCD can&apos;t sync</li><li>Which blocks deployments</li><li>Which means we defer the update</li></ul><p>Six months later, we&apos;re three major versions behind on everything. Security vulnerabilities pile up. Features we want are in newer versions. But the cost of updating&#x2014;testing 23 tools and all their integrations&#x2014;is too high.</p><p>So we stay on old versions and tell ourselves we&apos;ll &quot;catch up next quarter.&quot;</p><h2 id="the-specialization-trap">The Specialization Trap</h2><p>Here&apos;s how we used to solve observability:</p><p><strong>2015 Approach</strong></p><ul><li>App writes logs to stdout</li><li>Logs get aggregated</li><li>Grep for errors</li><li>Write scripts to parse patterns</li><li>Done</li></ul><p>It was basic. It worked. Anyone on the team could debug it.</p><p><strong>2024 Approach</strong></p><ul><li>Apps expose Prometheus metrics (RED method)</li><li>Prometheus scrapes and stores</li><li>Grafana visualizes</li><li>Loki ingests logs</li><li>Jaeger traces requests</li><li>Correlation between metrics/logs/traces requires:</li><li>Understanding PromQL</li><li>Writing Grafana queries</li><li>Configuring Loki labels correctly</li><li>Instrumenting code with trace IDs</li><li>Ensuring all three systems agree on time synchronization</li></ul><p>It&apos;s sophisticated. It&apos;s powerful. It requires specialists.</p><p>We&apos;ve turned observability into a discipline that requires dedicated engineers. Not because the problems got harder&#x2014;debugging is still debugging&#x2014;but because the tools got more complex.</p><p>The tools were supposed to make us more productive. Instead, they created new job titles.</p><h2 id="when-solutions-become-problems">When Solutions Become Problems</h2><p>I watched a team spend two weeks setting up GitOps with ArgoCD. Their deployment process was:</p><ul><li>Push to main</li><li>CI builds and pushes image</li><li>Kubernetes manifests update</li><li>Deploy</li></ul><p>It worked. Nobody complained. But someone went to a conference, saw a GitOps talk, came back convinced we needed it.</p><p>Now the process is:</p><ul><li>Push to main</li><li>CI builds and pushes image</li><li>CI updates manifests in Git</li><li>ArgoCD watches Git</li><li>ArgoCD syncs to cluster</li><li>Deploy</li></ul><p>The new process does the same thing with an extra tool and two more failure points. The benefits:</p><ul><li>Declarative state (we already had this, manifests were in Git)</li><li>Audit trail (we already had this, Git history)</li><li>Rollback capability (we already had this, redeploy old version)</li></ul><p>The costs:</p><ul><li>Another tool to learn, maintain, debug</li><li>Another place to check when deploys fail</li><li>Another integration to keep working</li><li>Another source of &quot;why isn&apos;t this deploying?&quot;</li></ul><p>We solved a problem we didn&apos;t have. Then celebrated the solution.</p><h2 id="the-vendor-love-affair">The Vendor Love Affair</h2><p>Tech Twitter convinced everyone that if you&apos;re not using the latest tools, you&apos;re falling behind. So we chase:</p><p><strong>Last Year&apos;s Hotness</strong></p><ul><li>Service mesh! Install Istio!</li><li>(Six months later: why is our cluster so slow?)</li><li>(One year later: nobody remembers how Istio works)</li><li>(18 months later: we removed Istio)</li></ul><p><strong>This Year&apos;s Hotness</strong></p><ul><li>eBPF observability! Install Pixie!</li><li>(Three months later: it&apos;s using 2GB per node)</li><li>(Six months later: conflicts with our CNI plugin)</li><li>(Nine months later: we&apos;re evaluating replacements)</li></ul><p><strong>Next Year&apos;s Hotness</strong></p><ul><li>Platform engineering! Build an IDP!</li><li>(Currently in POC phase)</li><li>(Check back in 18 months)</li></ul><p>Each wave promises to solve all our problems. Each wave adds complexity. Each wave eventually gets replaced by the next wave.</p><p>We&apos;re not building infrastructure. We&apos;re collecting tools like Pok&#xE9;mon.</p><h2 id="what-we-lost-along-the-way">What We Lost Along the Way</h2><p>Somewhere between &quot;deploy code&quot; and &quot;orchestrate 23 tools,&quot; we lost something important:</p><p><strong>Simplicity</strong><br>The ability to explain how your infrastructure works to a new team member in under an hour. Now it takes weeks.</p><p><strong>Debuggability</strong><br>The ability to trace a problem from symptom to cause without consulting five different UIs and correlating timestamps across systems.</p><p><strong>Ownership</strong><br>When everything requires specialists, nobody owns the whole system. Problems fall into gaps between tool boundaries.</p><p><strong>Velocity</strong><br>Each new tool slows down the next change. Want to add a feature? First make sure it works with all 23 existing tools.</p><p><strong>Understanding</strong><br>Most people on the team can operate the tools. Few people understand how they actually work. So when something breaks in a novel way, nobody knows how to fix it.</p><p>We traded these things for:</p><ul><li>Better metrics (that we don&apos;t look at)</li><li>Fancier dashboards (that we don&apos;t maintain)</li><li>More automation (that we don&apos;t trust)</li><li>Modern architecture (that we don&apos;t fully understand)</li></ul><h2 id="the-best-practices-trap">The &quot;Best Practices&quot; Trap</h2><p>Industry best practices say you should have:</p><ul><li>Infrastructure as Code (Terraform)</li><li>Container orchestration (Kubernetes)</li><li>GitOps (ArgoCD)</li><li>Observability (Prometheus + Grafana + Loki)</li><li>Secret management (Vault)</li><li>Security scanning (Trivy + Snyk)</li><li>Service mesh (Istio/Linkerd)</li></ul><p>So teams adopt all of them. Even when their needs are:</p><ul><li>3 microservices</li><li>100 requests per second</li><li>2 person team</li></ul><p>The best practices aren&apos;t wrong. They&apos;re just designed for problems most teams don&apos;t have. Netflix needs sophisticated observability. Your startup probably doesn&apos;t.</p><p>But we cargo cult the architecture because that&apos;s what &quot;good engineering&quot; looks like. Then we spend 60% of our time maintaining the infrastructure instead of building features.</p><h2 id="what-actually-matters">What Actually Matters</h2><p>I&apos;ve been thinking about what effective DevOps actually requires. Not the tools, the capabilities:</p><p><strong>Deployment Confidence</strong><br>Can you deploy without anxiety? Do you trust your deployment process?</p><p>This doesn&apos;t require GitOps. It requires:</p><ul><li>Automated tests that catch real issues</li><li>Rollback mechanism that works</li><li>Monitoring that detects problems quickly</li></ul><p>You can have this with GitHub Actions and <code>kubectl apply</code>. Or you can not have it with ArgoCD and Flux.</p><p><strong>Incident Response Speed</strong><br>When things break at 3 AM, can you fix them?</p><p>This doesn&apos;t require Grafana + Loki + Jaeger. It requires:</p><ul><li>Logs accessible from one place</li><li>Metrics that show what&apos;s wrong</li><li>Runbooks that work</li></ul><p>Simple alerting and basic log aggregation often works better than a sophisticated observability stack nobody understands.</p><p><strong>Development Velocity</strong><br>Can developers ship features without waiting on platform team?</p><p>This doesn&apos;t require a sophisticated IDP. It requires:</p><ul><li>Clear ownership boundaries</li><li>Self-service capabilities</li><li>Good documentation</li></ul><p>Sometimes a well-documented <code>kubectl</code> template works better than a custom UI nobody maintains.</p><h2 id="what-im-actually-recommending">What I&apos;m Actually Recommending</h2><p>Not &quot;stop using tools.&quot; But &quot;stop collecting tools.&quot;</p><p><strong>Before Adding a New Tool, Ask:</strong></p><ul><li>What problem does this actually solve?</li><li>Do we have this problem?</li><li>Can we solve it with existing tools?</li><li>What&apos;s the maintenance cost?</li><li>Who will own this long-term?</li><li>What happens if this breaks?</li><li>Can we remove it later if it doesn&apos;t work out?</li></ul><p><strong>When You Already Have Too Many Tools:</strong></p><ul><li>Audit honestly: which tools are actually used?</li><li>Which ones could be replaced by simpler alternatives?</li><li>Which ones exist because someone went to a conference?</li><li>What would break if we removed X?</li></ul><p>Sometimes the answer is &quot;we really do need all 23 tools.&quot; More often it&apos;s &quot;we could probably do this with 12.&quot;</p><p><strong>Prefer Boring Technology</strong><br>Not because boring is better. Because boring is:</p><ul><li>Well understood</li><li>Well documented</li><li>Well supported</li><li>Debuggable by your entire team</li><li>Less likely to break in novel ways</li></ul><p>Postgres is boring. Kubernetes is becoming boring. Terraform is boring. Prometheus is boring.</p><p>Boring doesn&apos;t make exciting blog posts. But boring works at 3 AM when you&apos;re on call.</p><h2 id="the-discipline-we-need">The Discipline We Need</h2><p>The hard part isn&apos;t adding tools. Any team can do that. The hard part is saying no.</p><p>No to the latest hype. No to the conference talk solution. No to the tool that&apos;s technically superior but operationally complex. No to complexity for its own sake.</p><p>We need the discipline to:</p><ul><li>Keep infrastructure boring</li><li>Add tools only for real problems</li><li>Remove tools that don&apos;t work out</li><li>Resist FOMO about latest trends</li><li>Value simplicity over sophistication</li><li>Optimize for long-term maintainability</li></ul><p>This is unpopular. It feels like falling behind. It looks like you&apos;re not innovating.</p><p>But maintaining 12 tools well beats maintaining 23 tools poorly. And being able to debug your infrastructure at 3 AM beats having the most sophisticated architecture that nobody understands.</p><p>The goal isn&apos;t to have the best tools. The goal is to have working systems that don&apos;t require heroic efforts to operate.</p><p>Most teams would be better off with half the tools and twice the understanding.</p>]]></content:encoded></item><item><title><![CDATA[Why Jujutsu (jj) Is Perfect for AI-Generated Code]]></title><description><![CDATA[<p><em>If you&apos;re using AI to write code, you need better version control. Git wasn&apos;t designed for the iterative, experimental nature of AI-assisted development. Jujutsu is.</em></p><h2 id="the-problem-git-fights-ai-workflows">The Problem: Git Fights AI Workflows</h2><p>When AI writes your code, your development process changes fundamentally. Instead of carefully crafted commits,</p>]]></description><link>https://cesar.velandia.co/why-jujutsu-jj-is-perfect-for-ai-generated-code/</link><guid isPermaLink="false">68c90d9d15a35701511ccc23</guid><category><![CDATA[AI development]]></category><category><![CDATA[writing]]></category><dc:creator><![CDATA[César D. Velandia]]></dc:creator><pubDate>Wed, 02 Jul 2025 07:18:00 GMT</pubDate><media:content url="https://cesar.velandia.co/content/images/2025/09/jujutsu.png" medium="image"/><content:encoded><![CDATA[<img src="https://cesar.velandia.co/content/images/2025/09/jujutsu.png" alt="Why Jujutsu (jj) Is Perfect for AI-Generated Code"><p><em>If you&apos;re using AI to write code, you need better version control. Git wasn&apos;t designed for the iterative, experimental nature of AI-assisted development. Jujutsu is.</em></p><h2 id="the-problem-git-fights-ai-workflows">The Problem: Git Fights AI Workflows</h2><p>When AI writes your code, your development process changes fundamentally. Instead of carefully crafted commits, you&apos;re dealing with:</p><ul><li><strong>Rapid iteration cycles</strong>: AI generates, you test, refine the prompt, regenerate</li><li><strong>Experimental branches</strong>: Multiple attempts at the same problem with different approaches</li><li><strong>Frequent rewrites</strong>: AI rarely matches your actual architecture on the first try</li><li><strong>History chaos</strong>: Your commit log becomes a graveyard of &quot;fix AI-generated bug&quot; commits</li></ul><p>Traditional Git workflows crumble under this pressure. You end up with messy histories, complex rebases, and the constant fear of losing work during cleanup. It&apos;s like trying to write a novel with a typewriter instead of a word processor (technically possible, but you&apos;re fighting your tools).</p><h2 id="enter-jujutsu-version-control-for-the-ai-age">Enter Jujutsu: Version Control for the AI Age</h2><p>Jujutsu (jj) reimagines version control around &quot;changes&quot; instead of &quot;commits.&quot;<a href="https://reasonablypolymorphic.com/blog/jj-strategy/"> </a>This subtle shift unlocks workflows that are perfect for AI-assisted development.</p><h3 id="changes-vs-commits-the-time-machine-effect">Changes vs. Commits: The Time Machine Effect</h3><p>In Git, once you commit, you&apos;re committed (pun intended). Want to fix something three commits back? Welcome to rebase hell.</p><p>In jj, every change is mutable. You can edit any change at any time, and jj automatically rebases everything downstream. It&apos;s like having a time machine for your code .</p><p><strong>Real scenario</strong>: Your AI generates a new API endpoint, but after testing, you realize the interface needs adjustment. In Git, you&apos;d either:</p><ul><li>Add a &quot;fix interface&quot; commit (messy history)</li><li>Interactive rebase (risk breaking things)</li></ul><p>In jj, you just edit the original change. Everything depending on it updates automatically. No git surgery required.</p><h3 id="perfect-for-ai-experimentation">Perfect for AI Experimentation</h3><p>AI development is inherently experimental. You&apos;ll often have multiple AI-generated solutions to compare. Jj&apos;s branching model makes this trivial:</p><pre><code class="language-bash"># Create three different AI attempts at the same feature
jj new -m &quot;AI attempt 1: REST API approach&quot;
# ... let AI generate solution 1

jj new -m &quot;AI attempt 2: GraphQL approach&quot; 
# ... let AI generate solution 2

jj new -m &quot;AI attempt 3: RPC approach&quot;
# ... let AI generate solution 3

# Check your current status
jj st
</code></pre><p>Your history stays clean, and you can easily compare approaches without complex Git gymnastics.</p><h3 id="stacked-changes-for-iterative-development">Stacked Changes for Iterative Development</h3><p>AI rarely gets complex features right on the first try. You&apos;ll typically iterate:</p><ol><li>AI generates basic structure</li><li>AI adds error handling</li><li>AI adds tests</li><li>AI optimizes performance</li></ol><p>In jj, these become a natural stack of changes:</p><pre><code>@ Add performance optimizations
&#x25C9; Add comprehensive tests
&#x25C9; Add error handling  
&#x25C9; Basic feature implementation
&#x25C9; main
</code></pre><p>When the AI needs to fix something in the basic implementation, you edit that change, and all the dependent changes automatically rebase. It&apos;s like editing the foundation of a house and having all the floors automatically adjust (no manual reconstruction required).</p><h3 id="bookmarks-git-branches-that-actually-make-sense">Bookmarks: Git Branches That Actually Make Sense</h3><p>Here&apos;s where jj really shines for GitHub workflows. Instead of Git&apos;s branch model, jj uses &quot;bookmarks.&quot;</p><p>In Git, you have to decide on a branch name before you know what you&apos;re building:</p><pre><code class="language-bash">git checkout -b feature/maybe-user-auth-or-something
</code></pre><p>In jj, you build first, name later:</p><pre><code class="language-bash"># Work on several changes
jj new -m &quot;Add user model&quot;
jj new -m &quot;Add authentication&quot;
jj new -m &quot;Add middleware&quot;

# Later, when you know what you built:
jj bookmark create user-auth-system
</code></pre><p>Bookmarks are just pointers to changes. Multiple bookmarks can point to the same change, and they move automatically as you rebase. When you&apos;re ready for a PR:</p><pre><code class="language-bash">jj git push --bookmark user-auth-system
</code></pre><p>This pushes your bookmark as a Git branch to GitHub. Your PR workflow remains unchanged, but your local development becomes infinitely more flexible.</p><h3 id="the-ai-development-workflow">The AI Development Workflow</h3><p>Here&apos;s how jj transforms AI-assisted development:</p><p><strong>1. Start with Architecture</strong> (Install jj first: <a href="https://github.com/martinvonz/jj/releases">GitHub releases</a>)</p><pre><code class="language-bash"># In your existing Git repo
jj init --git-repo .

jj new -m &quot;Define API interface&quot;
# You design the interface, AI fills implementation
</code></pre><p><strong>2. Iterative Implementation</strong></p><pre><code class="language-bash">jj new -m &quot;Implement user service&quot;
# Let AI implement based on your interface

jj new -m &quot;Add validation&quot;
# AI adds validation layer

jj st  # Check your current status

jj new -m &quot;Add error handling&quot;
# AI improves error handling
</code></pre><p><strong>3. Discover Issues and Fix Retroactively</strong></p><pre><code class="language-bash"># Oh no, the interface needs adjustment
jj edit &quot;Define API interface&quot;
# Make changes, everything downstream updates automatically
</code></pre><p><strong>4. Ship When Ready</strong></p><pre><code class="language-bash">jj bookmark create feature-user-service
jj git push --bookmark feature-user-service
</code></pre><p>Your final history tells the story of what was built, not how many times the AI hallucinated.</p><h3 id="handling-ais-favorite-mistake-the-everything-sandwich">Handling AI&apos;s Favorite Mistake: The Everything Sandwich</h3><p>AI loves to mix concerns. It&apos;ll add logging, error handling, database migrations, and new features all in one glorious mess. With jj, you can easily split these apart after the fact:</p><pre><code class="language-bash"># AI generated everything in one messy change
jj split
# Interactively split into logical pieces
</code></pre><p>This is nearly impossible in Git without performing open-heart surgery on your repository.</p><h3 id="the-refactoring-advantage">The Refactoring Advantage</h3><p>Mitchell Hashimoto (creator of Vagrant, Terraform) notes that AI agents excel at refactoring: &quot;<a href="https://zed.dev/blog/agentic-engineering-with-mitchell-hashimoto">Anytime I ask it to do that, it&apos;s always perfect.</a>&quot;</p><p>Jj makes AI refactoring risk-free. Since any change is editable, you can let AI aggressively refactor, knowing you can always edit or revert specific changes without losing other work. It&apos;s like having an undo button that actually understands context.</p><h2 id="why-this-matters-for-ai-generated-code">Why This Matters for AI-Generated Code</h2><p>AI is changing how we write software. The traditional Git model (linear commits, careful history curation) was designed for human development patterns.</p><p>AI generates code differently:</p><ul><li>More experimental (they don&apos;t have egos to protect)</li><li>Rapid iteration (they don&apos;t get tired)</li><li>Frequent architectural changes (they don&apos;t fall in love with their first solution)</li><li>Multiple attempts at solutions (they&apos;re happy to start over)</li></ul><p>Jj was built for exactly this kind of workflow. It&apos;s not just better at handling AI-generated code; it transforms how you think about version control entirely. Git feels like accounting software after using jj (technically correct but unnecessarily painful).</p><h2 id="the-learning-curve-hours-not-weeks">The Learning Curve: Hours, Not Weeks</h2><p>Basic jj concepts map to Git:</p><ul><li><code>git add</code> &#x2192; <code>jj new</code> (creates a change)</li><li><code>git commit</code> &#x2192; automatic (changes are always &quot;committed&quot;)</li><li><code>git log</code> &#x2192; <code>jj log</code></li><li><code>git status</code> &#x2192; <code>jj st</code></li><li><code>git push</code> &#x2192; <code>jj git push --bookmark &lt;name&gt;</code></li><li><code>git branch</code> &#x2192; <code>jj bookmark create &lt;name&gt;</code></li></ul><p>The difference is that jj&apos;s model is simpler and more forgiving. It&apos;s version control designed for the AI era.</p><h2 id="takeaways">Takeaways</h2><ul><li><strong>Mutable Changes</strong>: Edit any change anytime, automatic downstream rebasing</li><li><strong>Experiment Freely</strong>: Easy comparison of multiple AI solutions</li><li><strong>Stack Naturally</strong>: Iterative development cycles become manageable</li><li><strong>Risk-Free Refactoring</strong>: Let AI aggressively refactor with easy rollback</li><li><strong>Bookmarks &gt; Branches</strong>: Name things when you understand them, not before</li><li><strong>Zero Migration Cost</strong>: Works with existing Git repos and GitHub workflows</li><li><strong>Simple Mental Model</strong>: Changes instead of commits reduces cognitive overhead</li></ul><h2 id="dive-deeper">Dive deeper</h2><p><strong><a href="https://reasonablypolymorphic.com/blog/jj-strategy">Jujutsu Strategies</a></strong></p><p><strong><a href="https://steveklabnik.github.io/jujutsu-tutorial/">Steve&apos;s Jujutsu Tutorial</a></strong></p>]]></content:encoded></item></channel></rss>