What SREs Actually Do (And What Everyone Gets Wrong)

Troubleshooting

Jan 26

By: Nick Vecellio

Everyone’s got an opinion on what SREs do. Most of them are wrong.

Let’s get one thing out of the way: SREs don’t just “keep the site up.” If you think that’s the job, you’ve probably never actually done it—or you’ve only done the theory part. Because the reality of the work is messier, more impactful, and a heck of a lot more valuable when done right.

You want to know what an SRE actually does? I’ll tell you.

We Make Sure the Data Tells the Story

You can’t fix what you can’t see—and half the time, the story your dashboards tell is missing the key plot points. Part of the SRE job is working with teams to help them understand the data they have, the gaps they’re ignoring, and what that data actually means in the context of their apps.

Then we use that data to guide them toward better observability, smarter root cause analysis, and fewer middle-of-the-night Slack threads that end with “wait… why is this alert even firing?”

We Reduce Toil by Default

If you’re doing something more than once, you should automate it.

This doesn’t mean you need a perfect CI/CD setup or custom Terraform modules for everything (though… that helps). It means you create repeatable, self-documenting automation for anything that shouldn’t require a human every time. Whether it’s Ansible, Python, or Bash, the goal is clear: fewer manual actions, fewer mistakes, and fewer reasons to be paged at 3am.

We Don’t Just Alert on Problems. We Alert on Symptoms.

Monitoring disk space is easy. Monitoring “what’s actually going wrong” is harder. That’s why we build symptom-based observability. High latency? Rising error rates? Saturation spikes? These are the things that tell you something’s off—before your app bursts into flames.

Trying to monitor for every possible failure mode is a waste of time (and usually a noisy disaster). Monitor for the signals that something’s wrong, then troubleshoot from there. Think like a doctor—treat the symptoms to uncover the problem.

On-Call Sucks—We Make It Suck Less

Look, on-call sucks. That’s just the truth. But it doesn’t have to suck this bad.

If your alerts aren’t actionable—or worse, if they’re just warnings with no clear next step—they shouldn’t exist. A good SRE team ruthlessly tunes alerts so that only the important ones fire. Then we make sure there’s a runbook. Then we make sure that runbook is readable by someone half-asleep at 2am.

This is how you give sleep back to your team. This is how you build trust. This is how you keep good engineers from burning out.

We’re Embedded—Not External

An SRE isn’t some shadow team fixing stuff behind the curtain. We’re in the mix with the app teams. We’re helping them build reliable systems from day one. We’re writing infrastructure as code, building in observability, and using actual usage data to tighten up what’s already out there.

It’s not just “DevOps” and it’s not just “SRE”—because, frankly, those should be the same thing. You can’t have rapid deployment without reliability. You can’t build reliable systems without observability. And you can’t do any of it well if everyone’s working in silos.

We Learn How to Learn

No one knows everything. But good SREs know how to learn anything—and fast.

Whether it's debugging JVM memory issues, tracing Redis evictions, or figuring out why some synthetic monitor is flipping out (again), we troubleshoot with consistency. We ask good questions. We zoom out when needed. And yes, sometimes we just take action—safely—to fix the thing that’s broken.

We Prioritize Ugly but Functional Code

Automation doesn’t have to be pretty. It has to work. And it has to work safely.

Pretty code is nice, but well-documented, functional, recoverable code is what actually matters. Especially when you're gone and someone else has to run it. If it logs what it did, rolls back cleanly on failure, and doesn't require tribal knowledge to understand, you’re winning.

At the end of the day, what does an SRE do?

We keep things running. We prevent things from breaking. And when things do break (because they will), we make sure it sucks a whole lot less for everyone involved.

We don’t just respond to alerts—we reduce them.
We don’t just fix problems—we automate them out of existence.
We don’t just monitor—we make the data make sense.

Need a partner who knows how to build reliable systems? We can help—just get in touch.

srereliabilityculturefundamentals

Miranda Gaudet