Hi HN, author here.
TL;DR: I wrote this because I believe the hype around AI agents in observability is getting ahead of reality. After building an MCP server for our observability backend, I'm convinced they are powerful hypothesis generators, but not yet reliable problem solvers.
After reading a few articles claiming MCP would be the "end of observability," I felt the need to write down my own, more sceptical take, based on my experience building one of these systems.
My core argument is that these tools are effective at identifying known failure patterns, but they struggle with novel issues. During a high-stakes incident, the risk of following a confident-sounding LLM hallucination down a rabbit hole is dangerously high. Verifying the AI's suggestions can often be just as much work as finding the root cause yourself.
I would look at it from a demand-supply perspective. The demand will significantly reduce, and the supply has increased. I also love to look at it from a survival of the fittest perspective. If you are genuinely good at what you do and can drive exceptional results, you don''t have to worry. But if not, you might have to find something where you can "actually provide value" and not just be a fly on the wall.
TL;DR: I wrote this because I believe the hype around AI agents in observability is getting ahead of reality. After building an MCP server for our observability backend, I'm convinced they are powerful hypothesis generators, but not yet reliable problem solvers.
After reading a few articles claiming MCP would be the "end of observability," I felt the need to write down my own, more skeptical take, based on my experience building one of these systems.
My core argument is that these tools are effective at identifying known failure patterns, but they struggle with novel issues. During a high-stakes incident, the risk of following a confident-sounding LLM hallucination down a rabbit hole is dangerously high. Verifying the AI's suggestions can often be just as much work as finding the root cause yourself.
Ultimately, I see these agents as a co-pilot that can brainstorm, but can't yet be trusted to fly the plane.
Curious to hear from other SREs and developers: how are you really using these tools? Are you finding them reliable for RCA, or are you also spending significant time manually verifying their "confident" suggestions?
As the original author, some things that I could've potentially included to make it a more complete guide is,
- how to collect new telemetry alongside KPS
- showcase and correlate application level metrics along with infra in a single-view dashboard maybe?
-include the Operator way as well
Anything more to add? Trying to really make this a one-stop guide.
There are worse ways to start your day than sitting on still water in a boat that doesn’t want you there, trying to move forward anyway. Turns out that’s a decent metaphor for most things.
There are integrations that let you monitor your AWS resources also on SigNoz. That said, I personally think CloudWatch is painful in so many other ways as well,
After reading a few articles claiming MCP would be the "end of observability," I felt the need to write down my own, more sceptical take, based on my experience building one of these systems.
My core argument is that these tools are effective at identifying known failure patterns, but they struggle with novel issues. During a high-stakes incident, the risk of following a confident-sounding LLM hallucination down a rabbit hole is dangerously high. Verifying the AI's suggestions can often be just as much work as finding the root cause yourself.