The Tool Collector's Fallacy

Most teams have 10+ tools in their DevOps stack. Sometimes it feels like keeping up with DevOps tooling is overwhelming.

Kubernetes for orchestration. Terraform for infrastructure. ArgoCD for GitOps. Prometheus for metrics. Grafana for dashboards. Loki for logs. Jaeger for traces. Vault for secrets. Trivy for container scanning. Snyk for dependency scanning. SonarQube for code quality. Jenkins for legacy pipelines. GitHub Actions for new ones. Slack for notifications. PagerDuty for incidents. Datadog for... something, I forget what Datadog is supposed to do that Prometheus doesn't.

Each tool was added to solve a real problem. Each tool made sense at the time. Together, they've created a different problem: nobody actually understands the system anymore.

How We Became Tool Collectors

The pattern is always the same:

Act 1: The Problem
"We need better observability. Prometheus metrics aren't enough."

Act 2: The Research
"Grafana Loki looks promising. Integrates with our existing stack. Open source. Good community."

Act 3: The POC
"Loki works great in our test environment. Let's roll it out."

Act 4: The Rollout
"Now we have centralized logging. Problem solved."

Act 5: Six Months Later
"Why isn't anyone using Loki? Why are we still grepping container logs?"

Because adding the tool didn't solve the problem. It added another thing to learn, another thing to maintain, another thing to debug when it breaks. And it broke last Tuesday when the Loki cluster ran out of disk space and nobody noticed for six hours because we monitor Loki's health with... Prometheus. Which doesn't alert when Loki is down because that check was never added.

We collected a tool. We didn't solve the problem.

The Hidden Costs Nobody Talks About

Each new tool in your stack has an obvious cost (license, hosting, maintenance) and several hidden ones:

Cognitive Load Tax
Your team now needs to know 24 tools instead of 23. Each with its own:

  • Configuration syntax (YAML, HCL, JSON, TOML, all slightly different)
  • CLI interface and flags
  • API patterns and authentication
  • Debugging approaches
  • Failure modes
  • Update procedures

A new engineer joining the team used to learn Kubernetes, Terraform, and CI/CD. That was already a lot. Now they also need to learn Loki, Jaeger, Vault, ArgoCD, and whatever we add next quarter.

We're not making them more productive. We're just raising the barrier to productivity.

Integration Debt
Every tool needs to talk to every other tool. Prometheus scrapes Loki's metrics. Loki ingests logs from Kubernetes. Kubernetes pulls images scanned by Trivy. Trivy's results go to Slack via Jenkins. Jenkins auth uses Vault. Vault secrets come from Terraform.

This web of dependencies is fine until something breaks. Then you spend three hours discovering that the Slack notifications stopped because the Vault token expired, which Jenkins didn't detect properly, which meant Trivy scan results weren't being posted, which meant nobody noticed the critical CVE in the base image we've been deploying all week.

The tools work. The integrations are fragile.

The Knowledge Silo Problem
Person A knows Terraform. Person B knows Kubernetes. Person C knows Vault. Nobody knows all three well enough to debug the interaction when Kubernetes can't pull secrets from Vault because Terraform configured the IAM role wrong.

So you have a meeting. Three people spend two hours teaching each other enough about their respective tools to understand the problem. You fix it. Document it. Move on.

Next quarter, Person B leaves. New person joins. Doesn't know Kubernetes. The institutional knowledge evaporated.

We've built a system that requires tribal knowledge to operate. Then we're surprised when it's fragile.

Update Paralysis
Remember when updating your infrastructure meant updating one thing? Neither do I.

Now updates cascade:

  • Terraform has a new provider version
  • Which requires updating the Vault configuration syntax
  • Which breaks the Kubernetes integration
  • Which means ArgoCD can't sync
  • Which blocks deployments
  • Which means we defer the update

Six months later, we're three major versions behind on everything. Security vulnerabilities pile up. Features we want are in newer versions. But the cost of updating—testing 23 tools and all their integrations—is too high.

So we stay on old versions and tell ourselves we'll "catch up next quarter."

The Specialization Trap

Here's how we used to solve observability:

2015 Approach

  • App writes logs to stdout
  • Logs get aggregated
  • Grep for errors
  • Write scripts to parse patterns
  • Done

It was basic. It worked. Anyone on the team could debug it.

2024 Approach

  • Apps expose Prometheus metrics (RED method)
  • Prometheus scrapes and stores
  • Grafana visualizes
  • Loki ingests logs
  • Jaeger traces requests
  • Correlation between metrics/logs/traces requires:
  • Understanding PromQL
  • Writing Grafana queries
  • Configuring Loki labels correctly
  • Instrumenting code with trace IDs
  • Ensuring all three systems agree on time synchronization

It's sophisticated. It's powerful. It requires specialists.

We've turned observability into a discipline that requires dedicated engineers. Not because the problems got harder—debugging is still debugging—but because the tools got more complex.

The tools were supposed to make us more productive. Instead, they created new job titles.

When Solutions Become Problems

I watched a team spend two weeks setting up GitOps with ArgoCD. Their deployment process was:

  • Push to main
  • CI builds and pushes image
  • Kubernetes manifests update
  • Deploy

It worked. Nobody complained. But someone went to a conference, saw a GitOps talk, came back convinced we needed it.

Now the process is:

  • Push to main
  • CI builds and pushes image
  • CI updates manifests in Git
  • ArgoCD watches Git
  • ArgoCD syncs to cluster
  • Deploy

The new process does the same thing with an extra tool and two more failure points. The benefits:

  • Declarative state (we already had this, manifests were in Git)
  • Audit trail (we already had this, Git history)
  • Rollback capability (we already had this, redeploy old version)

The costs:

  • Another tool to learn, maintain, debug
  • Another place to check when deploys fail
  • Another integration to keep working
  • Another source of "why isn't this deploying?"

We solved a problem we didn't have. Then celebrated the solution.

The Vendor Love Affair

Tech Twitter convinced everyone that if you're not using the latest tools, you're falling behind. So we chase:

Last Year's Hotness

  • Service mesh! Install Istio!
  • (Six months later: why is our cluster so slow?)
  • (One year later: nobody remembers how Istio works)
  • (18 months later: we removed Istio)

This Year's Hotness

  • eBPF observability! Install Pixie!
  • (Three months later: it's using 2GB per node)
  • (Six months later: conflicts with our CNI plugin)
  • (Nine months later: we're evaluating replacements)

Next Year's Hotness

  • Platform engineering! Build an IDP!
  • (Currently in POC phase)
  • (Check back in 18 months)

Each wave promises to solve all our problems. Each wave adds complexity. Each wave eventually gets replaced by the next wave.

We're not building infrastructure. We're collecting tools like Pokémon.

What We Lost Along the Way

Somewhere between "deploy code" and "orchestrate 23 tools," we lost something important:

Simplicity
The ability to explain how your infrastructure works to a new team member in under an hour. Now it takes weeks.

Debuggability
The ability to trace a problem from symptom to cause without consulting five different UIs and correlating timestamps across systems.

Ownership
When everything requires specialists, nobody owns the whole system. Problems fall into gaps between tool boundaries.

Velocity
Each new tool slows down the next change. Want to add a feature? First make sure it works with all 23 existing tools.

Understanding
Most people on the team can operate the tools. Few people understand how they actually work. So when something breaks in a novel way, nobody knows how to fix it.

We traded these things for:

  • Better metrics (that we don't look at)
  • Fancier dashboards (that we don't maintain)
  • More automation (that we don't trust)
  • Modern architecture (that we don't fully understand)

The "Best Practices" Trap

Industry best practices say you should have:

  • Infrastructure as Code (Terraform)
  • Container orchestration (Kubernetes)
  • GitOps (ArgoCD)
  • Observability (Prometheus + Grafana + Loki)
  • Secret management (Vault)
  • Security scanning (Trivy + Snyk)
  • Service mesh (Istio/Linkerd)

So teams adopt all of them. Even when their needs are:

  • 3 microservices
  • 100 requests per second
  • 2 person team

The best practices aren't wrong. They're just designed for problems most teams don't have. Netflix needs sophisticated observability. Your startup probably doesn't.

But we cargo cult the architecture because that's what "good engineering" looks like. Then we spend 60% of our time maintaining the infrastructure instead of building features.

What Actually Matters

I've been thinking about what effective DevOps actually requires. Not the tools, the capabilities:

Deployment Confidence
Can you deploy without anxiety? Do you trust your deployment process?

This doesn't require GitOps. It requires:

  • Automated tests that catch real issues
  • Rollback mechanism that works
  • Monitoring that detects problems quickly

You can have this with GitHub Actions and kubectl apply. Or you can not have it with ArgoCD and Flux.

Incident Response Speed
When things break at 3 AM, can you fix them?

This doesn't require Grafana + Loki + Jaeger. It requires:

  • Logs accessible from one place
  • Metrics that show what's wrong
  • Runbooks that work

Simple alerting and basic log aggregation often works better than a sophisticated observability stack nobody understands.

Development Velocity
Can developers ship features without waiting on platform team?

This doesn't require a sophisticated IDP. It requires:

  • Clear ownership boundaries
  • Self-service capabilities
  • Good documentation

Sometimes a well-documented kubectl template works better than a custom UI nobody maintains.

What I'm Actually Recommending

Not "stop using tools." But "stop collecting tools."

Before Adding a New Tool, Ask:

  • What problem does this actually solve?
  • Do we have this problem?
  • Can we solve it with existing tools?
  • What's the maintenance cost?
  • Who will own this long-term?
  • What happens if this breaks?
  • Can we remove it later if it doesn't work out?

When You Already Have Too Many Tools:

  • Audit honestly: which tools are actually used?
  • Which ones could be replaced by simpler alternatives?
  • Which ones exist because someone went to a conference?
  • What would break if we removed X?

Sometimes the answer is "we really do need all 23 tools." More often it's "we could probably do this with 12."

Prefer Boring Technology
Not because boring is better. Because boring is:

  • Well understood
  • Well documented
  • Well supported
  • Debuggable by your entire team
  • Less likely to break in novel ways

Postgres is boring. Kubernetes is becoming boring. Terraform is boring. Prometheus is boring.

Boring doesn't make exciting blog posts. But boring works at 3 AM when you're on call.

The Discipline We Need

The hard part isn't adding tools. Any team can do that. The hard part is saying no.

No to the latest hype. No to the conference talk solution. No to the tool that's technically superior but operationally complex. No to complexity for its own sake.

We need the discipline to:

  • Keep infrastructure boring
  • Add tools only for real problems
  • Remove tools that don't work out
  • Resist FOMO about latest trends
  • Value simplicity over sophistication
  • Optimize for long-term maintainability

This is unpopular. It feels like falling behind. It looks like you're not innovating.

But maintaining 12 tools well beats maintaining 23 tools poorly. And being able to debug your infrastructure at 3 AM beats having the most sophisticated architecture that nobody understands.

The goal isn't to have the best tools. The goal is to have working systems that don't require heroic efforts to operate.

Most teams would be better off with half the tools and twice the understanding.


Cesar maintains too many tools in Taipei. Considering consolidation. Probably won't happen.